Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera
Hi

I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap usage by IndexSchema. This Solr in particular has one collection with 64 shards (2 replicas, but 64 cores on one node). The schema has ~120 fields, ~20 of them are of the same field type (text_general) and is serving around 700 concurrent users (peak), with a thread pool limit of 1000.

Reducing the thread-pool size is something they've tried, but the load is high and the server keeps up fine with the load, and a thread pool that size.

What surprised me is that they report obscene numbers they see in the heap: 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB coming from StandardTokenizerImpl.zzBuffer. That surprised me because I thought that a TokenStreamComponents can be (and is) reused for all fields in a document. And so even if we hold a ThreadLocal per TokenStreamComponents, we should see 1000 of them at the most - per Analyzer. And as I said, the analyzed fields are of type text_general, and the rest of the fields are numeric, DV, String, Bool etc. (aka not-analyzed).

Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy == PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:

64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could be they served less than 700 users when the heap dump was taken).

And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB of heap space!

Per Analyzer's constructor (which takes ReuseStrategy):

  /**
   * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
   * <p>
   * NOTE: if you just want to reuse on a per-field basis, it's easier to
   * use a subclass of {@link AnalyzerWrapper} such as
   * <a href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
   * PerFieldAnalyerWrapper</a> instead.
   */

However, AnalyzerWrapper's documentation somewhat contradicts it (I think):

  /**
   * Creates a new AnalyzerWrapper with the given reuse strategy.
   * <p>If you want to wrap a single delegate Analyzer you can probably
   * reuse its strategy when instantiating this subclass:
   * {@code super(delegate.getReuseStrategy());}.
   * <p>If you choose different analyzers per field, use
   * {@link #PER_FIELD_REUSE_STRATEGY}.

   * @see #getReuseStrategy()
   */

Maybe it is correct for AW, but not for DelegatingAW?

From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY since SolrIndexAnalyzer returns different Analyzers for different fields (per their field-type). But all fields that share the same Analyzer instance should be safe reusing its TokenStreamComponents, since we never process fields in parallel?

To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances for different fields), but it's the only piece of the puzzle that confuses me, since I trust whoever wrote this class to understand this stuff better than I do ...

What do you think?

Shai
Reply | Threaded
Open this post in threaded view
|

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shalin Shekhar Mangar
Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[hidden email]> wrote:

> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera

Thanks Shalin, but I reviewed the code in trunk, and it still passes PER_FIELD. I can double check but I'm pretty sure that's what I saw.

Shai

On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[hidden email]> wrote:
Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[hidden email]> wrote:
> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera

What I also want to double check is, after I sent the email I thought about it some more, and I think the PER_FIELD is passed as a fallback strategy, but otherwise it uses the wrapped analyzer's strategy. Maybe in 4.7 before Uwe fixed things some Analyzers still returned PER_FIELD, but now they don't anymore. I will double check that too.

Shai

On Jul 24, 2015 9:39 AM, "Shai Erera" <[hidden email]> wrote:

Thanks Shalin, but I reviewed the code in trunk, and it still passes PER_FIELD. I can double check but I'm pretty sure that's what I saw.

Shai

On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[hidden email]> wrote:
Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[hidden email]> wrote:
> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Uwe Schindler
In reply to this post by Shai Erera

In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of AnalyzerWrapper.

 

DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one here and is used as fallback only (for incompatible configurations, e.g. when one of the per-field configs wrap with a filter or charfilter – but this does not happen in Solr for fields). See the patch: https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch

 

It is very important that you use the PER_FIELD one as fallback strategy, because otherwise it would break stuff like AnalysisRequestHandler (because this one wraps). The Analyzer works per field, so any unknown delegate must be cached per field.

 

The idea of LUCENE-5803 is to also delegate the “caching”. If the SolrAnalyzer does not wrap components, it can also delegate the caching. The wrapper’s reuse strategy is then unused. The delegate, FieldType’s Analyzer, uses TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each FieldType caches globally, no matter how many field instances..

 

So all is fine, it is just 4.7 where this optimization is not used.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Shai Erera [mailto:[hidden email]]
Sent: Friday, July 24, 2015 8:39 AM
To: [hidden email]
Subject: Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

 

Thanks Shalin, but I reviewed the code in trunk, and it still passes PER_FIELD. I can double check but I'm pretty sure that's what I saw.

Shai

On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[hidden email]> wrote:

Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[hidden email]> wrote:


> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="[hidden email]">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera

Ok thanks Uwe, this makes sense now!

Shai

On Jul 24, 2015 9:55 AM, "Uwe Schindler" <[hidden email]> wrote:

In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of AnalyzerWrapper.

 

DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one here and is used as fallback only (for incompatible configurations, e.g. when one of the per-field configs wrap with a filter or charfilter – but this does not happen in Solr for fields). See the patch: https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch

 

It is very important that you use the PER_FIELD one as fallback strategy, because otherwise it would break stuff like AnalysisRequestHandler (because this one wraps). The Analyzer works per field, so any unknown delegate must be cached per field.

 

The idea of LUCENE-5803 is to also delegate the “caching”. If the SolrAnalyzer does not wrap components, it can also delegate the caching. The wrapper’s reuse strategy is then unused. The delegate, FieldType’s Analyzer, uses TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each FieldType caches globally, no matter how many field instances..

 

So all is fine, it is just 4.7 where this optimization is not used.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Shai Erera [mailto:[hidden email]]
Sent: Friday, July 24, 2015 8:39 AM
To: [hidden email]
Subject: Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

 

Thanks Shalin, but I reviewed the code in trunk, and it still passes PER_FIELD. I can double check but I'm pretty sure that's what I saw.

Shai

On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[hidden email]> wrote:

Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[hidden email]> wrote:


> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="[hidden email]">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]