File Descriptor/Memory Leak

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

File Descriptor/Memory Leak

Mads Tomasgård Bjørgan
Hello there,
Our SolrCloud is experiencing a FD leak while running with SSL. This is occurring on the one machine that our program is sending data too. We have a total of three servers running as an ensemble.

While running without SSL does the FD Count remain quite constant at around 180 while indexing. Performing a garbage collection also clears almost the entire JVM-memory.

However - when indexing with SSL does the FDC grow polynomial. The count increases with a few hundred every five seconds or so, but reaches easily 50 000 within three to four minutes. Performing a GC swipes most of the memory on the two machines our program isn't transmitting the data directly to. The last machine is unaffected by the GC, and both memory nor FDC doesn't reset before Solr is restarted on that machine.

Performing a netstat reveals that the FDC mostly consists of TCP-connections in the state of "CLOSE_WAIT".


Reply | Threaded
Open this post in threaded view
|

Re: File Descriptor/Memory Leak

Shalin Shekhar Mangar
I have myself seen this CLOSE_WAIT issue at a customer. I am running some
tests with different versions trying to pinpoint the cause of this leak.
Once I have some more information and a reproducible test, I'll open a jira
issue. I'll keep you posted.

On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]> wrote:

> Hello there,
> Our SolrCloud is experiencing a FD leak while running with SSL. This is
> occurring on the one machine that our program is sending data too. We have
> a total of three servers running as an ensemble.
>
> While running without SSL does the FD Count remain quite constant at
> around 180 while indexing. Performing a garbage collection also clears
> almost the entire JVM-memory.
>
> However - when indexing with SSL does the FDC grow polynomial. The count
> increases with a few hundred every five seconds or so, but reaches easily
> 50 000 within three to four minutes. Performing a GC swipes most of the
> memory on the two machines our program isn't transmitting the data directly
> to. The last machine is unaffected by the GC, and both memory nor FDC
> doesn't reset before Solr is restarted on that machine.
>
> Performing a netstat reveals that the FDC mostly consists of
> TCP-connections in the state of "CLOSE_WAIT".
>
>
>


--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: File Descriptor/Memory Leak

Shai Erera
Shalin, we're seeing that issue too (and actually actively debugging it
these days). So far I can confirm the following (on a 2-node cluster):

1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
2) It does not reproduce when SSL is disabled
3) Restarting the Solr process (sometimes both need to be restarted), the
count drops to 0, but if indexing continues, they climb up again

When it does happen, Solr seems stuck. The leader cannot talk to the
replica, or vice versa, the replica is usually put in DOWN state and
there's no way to fix it besides restarting the JVM.

Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look
legit. That did not help, and honestly I've done that before we suspected
it might be the SSL. Therefore I think those are "safe", but just FYI.

When it does happen, the number of CLOSE_WAITS climb very high, to the
order of 30K+ entries in 'netstat'.

When I say it does not reproduce on 5.4.1 I really mean the numbers don't
go as high as they do in 5.5.1. Meaning, when running without SSL, the
number of CLOSE_WAITs is smallish, usually less than a 10 (I would
separately like to understand why we have any in that state at all). When
running with SSL and 5.4.1, they stay low at the order of hundreds the most.

Unfortunately running without SSL is not an option for us. We will likely
roll back to 5.4.1, even if the problem exists there, but to a lesser
degree.

I will post back here when/if we have more info about this.

Shai

On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <[hidden email]>
wrote:

> I have myself seen this CLOSE_WAIT issue at a customer. I am running some
> tests with different versions trying to pinpoint the cause of this leak.
> Once I have some more information and a reproducible test, I'll open a jira
> issue. I'll keep you posted.
>
> On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]>
> wrote:
>
> > Hello there,
> > Our SolrCloud is experiencing a FD leak while running with SSL. This is
> > occurring on the one machine that our program is sending data too. We
> have
> > a total of three servers running as an ensemble.
> >
> > While running without SSL does the FD Count remain quite constant at
> > around 180 while indexing. Performing a garbage collection also clears
> > almost the entire JVM-memory.
> >
> > However - when indexing with SSL does the FDC grow polynomial. The count
> > increases with a few hundred every five seconds or so, but reaches easily
> > 50 000 within three to four minutes. Performing a GC swipes most of the
> > memory on the two machines our program isn't transmitting the data
> directly
> > to. The last machine is unaffected by the GC, and both memory nor FDC
> > doesn't reset before Solr is restarted on that machine.
> >
> > Performing a netstat reveals that the FDC mostly consists of
> > TCP-connections in the state of "CLOSE_WAIT".
> >
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
Reply | Threaded
Open this post in threaded view
|

Re: File Descriptor/Memory Leak

Anshum Gupta
I've created a JIRA to track this:
https://issues.apache.org/jira/browse/SOLR-9290

On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <[hidden email]> wrote:

> Shalin, we're seeing that issue too (and actually actively debugging it
> these days). So far I can confirm the following (on a 2-node cluster):
>
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
>
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
>
> Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
> looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look
> legit. That did not help, and honestly I've done that before we suspected
> it might be the SSL. Therefore I think those are "safe", but just FYI.
>
> When it does happen, the number of CLOSE_WAITS climb very high, to the
> order of 30K+ entries in 'netstat'.
>
> When I say it does not reproduce on 5.4.1 I really mean the numbers don't
> go as high as they do in 5.5.1. Meaning, when running without SSL, the
> number of CLOSE_WAITs is smallish, usually less than a 10 (I would
> separately like to understand why we have any in that state at all). When
> running with SSL and 5.4.1, they stay low at the order of hundreds the
> most.
>
> Unfortunately running without SSL is not an option for us. We will likely
> roll back to 5.4.1, even if the problem exists there, but to a lesser
> degree.
>
> I will post back here when/if we have more info about this.
>
> Shai
>
> On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
> [hidden email]>
> wrote:
>
> > I have myself seen this CLOSE_WAIT issue at a customer. I am running some
> > tests with different versions trying to pinpoint the cause of this leak.
> > Once I have some more information and a reproducible test, I'll open a
> jira
> > issue. I'll keep you posted.
> >
> > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]>
> > wrote:
> >
> > > Hello there,
> > > Our SolrCloud is experiencing a FD leak while running with SSL. This is
> > > occurring on the one machine that our program is sending data too. We
> > have
> > > a total of three servers running as an ensemble.
> > >
> > > While running without SSL does the FD Count remain quite constant at
> > > around 180 while indexing. Performing a garbage collection also clears
> > > almost the entire JVM-memory.
> > >
> > > However - when indexing with SSL does the FDC grow polynomial. The
> count
> > > increases with a few hundred every five seconds or so, but reaches
> easily
> > > 50 000 within three to four minutes. Performing a GC swipes most of the
> > > memory on the two machines our program isn't transmitting the data
> > directly
> > > to. The last machine is unaffected by the GC, and both memory nor FDC
> > > doesn't reset before Solr is restarted on that machine.
> > >
> > > Performing a netstat reveals that the FDC mostly consists of
> > > TCP-connections in the state of "CLOSE_WAIT".
> > >
> > >
> > >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>



--
Anshum Gupta
Reply | Threaded
Open this post in threaded view
|

RE: File Descriptor/Memory Leak

Mads Tomasgård Bjørgan
FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs every single time when running with SSL).

-----Original Message-----
From: Anshum Gupta [mailto:[hidden email]]
Sent: torsdag 7. juli 2016 18.14
To: [hidden email]
Subject: Re: File Descriptor/Memory Leak

I've created a JIRA to track this:
https://issues.apache.org/jira/browse/SOLR-9290

On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <[hidden email]> wrote:

> Shalin, we're seeing that issue too (and actually actively debugging
> it these days). So far I can confirm the following (on a 2-node cluster):
>
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on
> 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted),
> the count drops to 0, but if indexing continues, they climb up again
>
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
>
> Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
> looked suspicious (SOLR-8451 and SOLR-8578), even though the changes
> look legit. That did not help, and honestly I've done that before we
> suspected it might be the SSL. Therefore I think those are "safe", but just FYI.
>
> When it does happen, the number of CLOSE_WAITS climb very high, to the
> order of 30K+ entries in 'netstat'.
>
> When I say it does not reproduce on 5.4.1 I really mean the numbers
> don't go as high as they do in 5.5.1. Meaning, when running without
> SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
> would separately like to understand why we have any in that state at
> all). When running with SSL and 5.4.1, they stay low at the order of
> hundreds the most.
>
> Unfortunately running without SSL is not an option for us. We will
> likely roll back to 5.4.1, even if the problem exists there, but to a
> lesser degree.
>
> I will post back here when/if we have more info about this.
>
> Shai
>
> On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
> [hidden email]>
> wrote:
>
> > I have myself seen this CLOSE_WAIT issue at a customer. I am running
> > some tests with different versions trying to pinpoint the cause of this leak.
> > Once I have some more information and a reproducible test, I'll open
> > a
> jira
> > issue. I'll keep you posted.
> >
> > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]>
> > wrote:
> >
> > > Hello there,
> > > Our SolrCloud is experiencing a FD leak while running with SSL.
> > > This is occurring on the one machine that our program is sending
> > > data too. We
> > have
> > > a total of three servers running as an ensemble.
> > >
> > > While running without SSL does the FD Count remain quite constant
> > > at around 180 while indexing. Performing a garbage collection also
> > > clears almost the entire JVM-memory.
> > >
> > > However - when indexing with SSL does the FDC grow polynomial. The
> count
> > > increases with a few hundred every five seconds or so, but reaches
> easily
> > > 50 000 within three to four minutes. Performing a GC swipes most
> > > of the memory on the two machines our program isn't transmitting
> > > the data
> > directly
> > > to. The last machine is unaffected by the GC, and both memory nor
> > > FDC doesn't reset before Solr is restarted on that machine.
> > >
> > > Performing a netstat reveals that the FDC mostly consists of
> > > TCP-connections in the state of "CLOSE_WAIT".
> > >
> > >
> > >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>



--
Anshum Gupta
Reply | Threaded
Open this post in threaded view
|

RE: File Descriptor/Memory Leak

Alexandre Rafalovitch
Is there a firewall between a client and a server by any chance?

CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question
is why sockets are reopened that often or why the other side does not
acknowledge TCP termination packet fast.

I would run Ethereal to troubleshoot that. And truss/strace.

Regards,
    Alex
On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <[hidden email]> wrote:

FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs
every single time when running with SSL).

-----Original Message-----
From: Anshum Gupta [mailto:[hidden email]]
Sent: torsdag 7. juli 2016 18.14
To: [hidden email]
Subject: Re: File Descriptor/Memory Leak

I've created a JIRA to track this:
https://issues.apache.org/jira/browse/SOLR-9290

On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <[hidden email]> wrote:

> Shalin, we're seeing that issue too (and actually actively debugging
> it these days). So far I can confirm the following (on a 2-node cluster):
>
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on
> 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted),
> the count drops to 0, but if indexing continues, they climb up again
>
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
>
> Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
> looked suspicious (SOLR-8451 and SOLR-8578), even though the changes
> look legit. That did not help, and honestly I've done that before we
> suspected it might be the SSL. Therefore I think those are "safe", but
just FYI.

>
> When it does happen, the number of CLOSE_WAITS climb very high, to the
> order of 30K+ entries in 'netstat'.
>
> When I say it does not reproduce on 5.4.1 I really mean the numbers
> don't go as high as they do in 5.5.1. Meaning, when running without
> SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
> would separately like to understand why we have any in that state at
> all). When running with SSL and 5.4.1, they stay low at the order of
> hundreds the most.
>
> Unfortunately running without SSL is not an option for us. We will
> likely roll back to 5.4.1, even if the problem exists there, but to a
> lesser degree.
>
> I will post back here when/if we have more info about this.
>
> Shai
>
> On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
> [hidden email]>
> wrote:
>
> > I have myself seen this CLOSE_WAIT issue at a customer. I am running
> > some tests with different versions trying to pinpoint the cause of this
leak.

> > Once I have some more information and a reproducible test, I'll open
> > a
> jira
> > issue. I'll keep you posted.
> >
> > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]>
> > wrote:
> >
> > > Hello there,
> > > Our SolrCloud is experiencing a FD leak while running with SSL.
> > > This is occurring on the one machine that our program is sending
> > > data too. We
> > have
> > > a total of three servers running as an ensemble.
> > >
> > > While running without SSL does the FD Count remain quite constant
> > > at around 180 while indexing. Performing a garbage collection also
> > > clears almost the entire JVM-memory.
> > >
> > > However - when indexing with SSL does the FDC grow polynomial. The
> count
> > > increases with a few hundred every five seconds or so, but reaches
> easily
> > > 50 000 within three to four minutes. Performing a GC swipes most
> > > of the memory on the two machines our program isn't transmitting
> > > the data
> > directly
> > > to. The last machine is unaffected by the GC, and both memory nor
> > > FDC doesn't reset before Solr is restarted on that machine.
> > >
> > > Performing a netstat reveals that the FDC mostly consists of
> > > TCP-connections in the state of "CLOSE_WAIT".
> > >
> > >
> > >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>



--
Anshum Gupta
Reply | Threaded
Open this post in threaded view
|

Re: File Descriptor/Memory Leak

Shai Erera
There is no firewall and the CLOSE_WAITs are between Solr-to-Solr nodes
(the origin and destination IP:PORT belong to Solr).

Also, note that the same test runs fine on 5.4.1, even though there are
still few hundreds of CLOSE_WAITs. I'm looking at what has changed in the
code between 5.4.1 and 5.5.1. It's also only reproducible when Solr is run
in SSL mode, so the problem might lie in HttpClient/Jetty too.

Shai

On Fri, Jul 8, 2016 at 11:59 AM Alexandre Rafalovitch <[hidden email]>
wrote:

> Is there a firewall between a client and a server by any chance?
>
> CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question
> is why sockets are reopened that often or why the other side does not
> acknowledge TCP termination packet fast.
>
> I would run Ethereal to troubleshoot that. And truss/strace.
>
> Regards,
>     Alex
> On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <[hidden email]> wrote:
>
> FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs
> every single time when running with SSL).
>
> -----Original Message-----
> From: Anshum Gupta [mailto:[hidden email]]
> Sent: torsdag 7. juli 2016 18.14
> To: [hidden email]
> Subject: Re: File Descriptor/Memory Leak
>
> I've created a JIRA to track this:
> https://issues.apache.org/jira/browse/SOLR-9290
>
> On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <[hidden email]> wrote:
>
> > Shalin, we're seeing that issue too (and actually actively debugging
> > it these days). So far I can confirm the following (on a 2-node cluster):
> >
> > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on
> > 5.4.1
> > 2) It does not reproduce when SSL is disabled
> > 3) Restarting the Solr process (sometimes both need to be restarted),
> > the count drops to 0, but if indexing continues, they climb up again
> >
> > When it does happen, Solr seems stuck. The leader cannot talk to the
> > replica, or vice versa, the replica is usually put in DOWN state and
> > there's no way to fix it besides restarting the JVM.
> >
> > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
> > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes
> > look legit. That did not help, and honestly I've done that before we
> > suspected it might be the SSL. Therefore I think those are "safe", but
> just FYI.
> >
> > When it does happen, the number of CLOSE_WAITS climb very high, to the
> > order of 30K+ entries in 'netstat'.
> >
> > When I say it does not reproduce on 5.4.1 I really mean the numbers
> > don't go as high as they do in 5.5.1. Meaning, when running without
> > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
> > would separately like to understand why we have any in that state at
> > all). When running with SSL and 5.4.1, they stay low at the order of
> > hundreds the most.
> >
> > Unfortunately running without SSL is not an option for us. We will
> > likely roll back to 5.4.1, even if the problem exists there, but to a
> > lesser degree.
> >
> > I will post back here when/if we have more info about this.
> >
> > Shai
> >
> > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
> > [hidden email]>
> > wrote:
> >
> > > I have myself seen this CLOSE_WAIT issue at a customer. I am running
> > > some tests with different versions trying to pinpoint the cause of this
> leak.
> > > Once I have some more information and a reproducible test, I'll open
> > > a
> > jira
> > > issue. I'll keep you posted.
> > >
> > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]>
> > > wrote:
> > >
> > > > Hello there,
> > > > Our SolrCloud is experiencing a FD leak while running with SSL.
> > > > This is occurring on the one machine that our program is sending
> > > > data too. We
> > > have
> > > > a total of three servers running as an ensemble.
> > > >
> > > > While running without SSL does the FD Count remain quite constant
> > > > at around 180 while indexing. Performing a garbage collection also
> > > > clears almost the entire JVM-memory.
> > > >
> > > > However - when indexing with SSL does the FDC grow polynomial. The
> > count
> > > > increases with a few hundred every five seconds or so, but reaches
> > easily
> > > > 50 000 within three to four minutes. Performing a GC swipes most
> > > > of the memory on the two machines our program isn't transmitting
> > > > the data
> > > directly
> > > > to. The last machine is unaffected by the GC, and both memory nor
> > > > FDC doesn't reset before Solr is restarted on that machine.
> > > >
> > > > Performing a netstat reveals that the FDC mostly consists of
> > > > TCP-connections in the state of "CLOSE_WAIT".
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>
>
>
> --
> Anshum Gupta
>
Reply | Threaded
Open this post in threaded view
|

Re: File Descriptor/Memory Leak

Alexandre Rafalovitch
If this is reproducible, I would run the comparison under Wireshark
(used to be called Ehtereal) https://www.wireshark.org/ . It would
capture full network traffic and can even be run on a machine separate
from either client or server (in promiscuous mode).

Then, I would look at number of connections differences between HTTP
and HTTPS for the same test. Perhaps HTTP is doing request pipelining
and HTTPS does not. This would lead to more sockets (and more
CLOSE_WAITs) for the same content.

If the number of connection is the same, then I would pick a similar
transaction and see the delays between the closing sequence
FIN/SYN/whatever packets. If, after the server sends the closing
packet, the client does not reply as fast with its own closing packet
under HTTPS, then the problem is socket closing code. Obviously, SSL
establishment of the connection is more painful/expensive than
non-SSL, but the issue here is closing of one.

This was the way I troubleshooted these scenarios many years ago as
Weblogic senior tech support. I still think approaching this from
network up is the most viable approach.

Regards,
   Alex.

----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 July 2016 at 17:05, Shai Erera <[hidden email]> wrote:

> There is no firewall and the CLOSE_WAITs are between Solr-to-Solr nodes
> (the origin and destination IP:PORT belong to Solr).
>
> Also, note that the same test runs fine on 5.4.1, even though there are
> still few hundreds of CLOSE_WAITs. I'm looking at what has changed in the
> code between 5.4.1 and 5.5.1. It's also only reproducible when Solr is run
> in SSL mode, so the problem might lie in HttpClient/Jetty too.
>
> Shai
>
> On Fri, Jul 8, 2016 at 11:59 AM Alexandre Rafalovitch <[hidden email]>
> wrote:
>
>> Is there a firewall between a client and a server by any chance?
>>
>> CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question
>> is why sockets are reopened that often or why the other side does not
>> acknowledge TCP termination packet fast.
>>
>> I would run Ethereal to troubleshoot that. And truss/strace.
>>
>> Regards,
>>     Alex
>> On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <[hidden email]> wrote:
>>
>> FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs
>> every single time when running with SSL).
>>
>> -----Original Message-----
>> From: Anshum Gupta [mailto:[hidden email]]
>> Sent: torsdag 7. juli 2016 18.14
>> To: [hidden email]
>> Subject: Re: File Descriptor/Memory Leak
>>
>> I've created a JIRA to track this:
>> https://issues.apache.org/jira/browse/SOLR-9290
>>
>> On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <[hidden email]> wrote:
>>
>> > Shalin, we're seeing that issue too (and actually actively debugging
>> > it these days). So far I can confirm the following (on a 2-node cluster):
>> >
>> > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on
>> > 5.4.1
>> > 2) It does not reproduce when SSL is disabled
>> > 3) Restarting the Solr process (sometimes both need to be restarted),
>> > the count drops to 0, but if indexing continues, they climb up again
>> >
>> > When it does happen, Solr seems stuck. The leader cannot talk to the
>> > replica, or vice versa, the replica is usually put in DOWN state and
>> > there's no way to fix it besides restarting the JVM.
>> >
>> > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
>> > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes
>> > look legit. That did not help, and honestly I've done that before we
>> > suspected it might be the SSL. Therefore I think those are "safe", but
>> just FYI.
>> >
>> > When it does happen, the number of CLOSE_WAITS climb very high, to the
>> > order of 30K+ entries in 'netstat'.
>> >
>> > When I say it does not reproduce on 5.4.1 I really mean the numbers
>> > don't go as high as they do in 5.5.1. Meaning, when running without
>> > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
>> > would separately like to understand why we have any in that state at
>> > all). When running with SSL and 5.4.1, they stay low at the order of
>> > hundreds the most.
>> >
>> > Unfortunately running without SSL is not an option for us. We will
>> > likely roll back to 5.4.1, even if the problem exists there, but to a
>> > lesser degree.
>> >
>> > I will post back here when/if we have more info about this.
>> >
>> > Shai
>> >
>> > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
>> > [hidden email]>
>> > wrote:
>> >
>> > > I have myself seen this CLOSE_WAIT issue at a customer. I am running
>> > > some tests with different versions trying to pinpoint the cause of this
>> leak.
>> > > Once I have some more information and a reproducible test, I'll open
>> > > a
>> > jira
>> > > issue. I'll keep you posted.
>> > >
>> > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <[hidden email]>
>> > > wrote:
>> > >
>> > > > Hello there,
>> > > > Our SolrCloud is experiencing a FD leak while running with SSL.
>> > > > This is occurring on the one machine that our program is sending
>> > > > data too. We
>> > > have
>> > > > a total of three servers running as an ensemble.
>> > > >
>> > > > While running without SSL does the FD Count remain quite constant
>> > > > at around 180 while indexing. Performing a garbage collection also
>> > > > clears almost the entire JVM-memory.
>> > > >
>> > > > However - when indexing with SSL does the FDC grow polynomial. The
>> > count
>> > > > increases with a few hundred every five seconds or so, but reaches
>> > easily
>> > > > 50 000 within three to four minutes. Performing a GC swipes most
>> > > > of the memory on the two machines our program isn't transmitting
>> > > > the data
>> > > directly
>> > > > to. The last machine is unaffected by the GC, and both memory nor
>> > > > FDC doesn't reset before Solr is restarted on that machine.
>> > > >
>> > > > Performing a netstat reveals that the FDC mostly consists of
>> > > > TCP-connections in the state of "CLOSE_WAIT".
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > Shalin Shekhar Mangar.
>> > >
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>