NVMe - SSD shredding due to Lucene :-)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

NVMe - SSD shredding due to Lucene :-)

Uwe Schindler
Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests ūüėä

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NVMe - SSD shredding due to Lucene :-)

caomanhdat
Thanks Uwe for keeping the Police up and running!

On Sat, 31 Aug 2019 at 11:20, Uwe Schindler <[hidden email]> wrote:
Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests ūüėä

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
Best regards,
Cao MŠļ°nh ńźŠļ°t
E-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: NVMe - SSD shredding due to Lucene :-)

Michael McCandless-2
In reply to this post by Uwe Schindler
Nice to know :)  Thanks for upgrading Uwe.

I thought we randomly disable fsync in tests just to protect our precious SSDs?

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <[hidden email]> wrote:
Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests ūüėä

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: NVMe - SSD shredding due to Lucene :-)

Uwe Schindler

Hi,

 

the service to replace those SSD is included in rental fee 😊

 

I am not sure why it writes so much, but I think Solr is more hammering our SSDs. Lucene‚Äôs test do not do too much IO. Nevertheless, the SSD survived more than 2 years. The server was installed on 2017-05-19. After some runtime I calculated the approximate lifetime and I was not bad in estimating: I said 2 years 😊

 

FYI, at the moment they replace disk #2 (I rebuilt the raid array before).

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [hidden email]

 

From: Michael McCandless <[hidden email]>
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <[hidden email]>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <[hidden email]> wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:


> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:
[hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail:
[hidden email]
For additional commands, e-mail:
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NVMe - SSD shredding due to Lucene :-)

Michael McCandless-2
SSD vendors should use our tests for QA'ing their new SSDs!

On Sat, Aug 31, 2019 at 7:50 AM Uwe Schindler <[hidden email]> wrote:

Hi,

 

the service to replace those SSD is included in rental fee ūüėä

 

I am not sure why it writes so much, but I think Solr is more hammering our SSDs. Lucene‚Äôs test do not do too much IO. Nevertheless, the SSD survived more than 2 years. The server was installed on 2017-05-19. After some runtime I calculated the approximate lifetime and I was not bad in estimating: I said 2 years ūüėä

 

FYI, at the moment they replace disk #2 (I rebuilt the raid array before).

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [hidden email]

 

From: Michael McCandless <[hidden email]>
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <[hidden email]>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?


 

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <[hidden email]> wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:


> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests ūüėä

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:
[hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail:
[hidden email]
For additional commands, e-mail:
[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: NVMe - SSD shredding due to Lucene :-)

Uwe Schindler

Hi,

 

NVMe SSD #2 also replaced. Both are of course ‚Äúrecycled ones‚ÄĚ ‚Äď that‚Äôs how data centers work (if people no longer use server and cancel rental agreement, the SSDs are recycled and reused as recovery parts ‚Äď unless their smart status is bad). But lifetime is good for at least 1.5 years.

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [hidden email]

 

From: Michael McCandless <[hidden email]>
Sent: Saturday, August 31, 2019 2:02 PM
To: Lucene/Solr dev <[hidden email]>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

SSD vendors should use our tests for QA'ing their new SSDs!

 

On Sat, Aug 31, 2019 at 7:50 AM Uwe Schindler <[hidden email]> wrote:

Hi,

 

the service to replace those SSD is included in rental fee 😊

 

I am not sure why it writes so much, but I think Solr is more hammering our SSDs. Lucene‚Äôs test do not do too much IO. Nevertheless, the SSD survived more than 2 years. The server was installed on 2017-05-19. After some runtime I calculated the approximate lifetime and I was not bad in estimating: I said 2 years 😊

 

FYI, at the moment they replace disk #2 (I rebuilt the raid array before).

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [hidden email]

 

From: Michael McCandless <[hidden email]>
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <[hidden email]>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <[hidden email]> wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:


> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:
[hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail:
[hidden email]
For additional commands, e-mail:
[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: NVMe - SSD shredding due to Lucene :-)

Uwe Schindler
In reply to this post by Michael McCandless-2

Hi Mike,

 

you are right we have the special NIO.2 filesystem that makes fsync a no-op in 90% of all cases. This works fine with Lucene, but as Solr does not use the virtual filesystem and instead just copies the path name of the temp directory as a string and puts it into the default directory factory through its solrconfig.xml file, there is no way to capture fsyncs, as Solr uses plain default filesystem.

 

We should work on a solution for this, as it may speed up tests dramatically.

 

In the meantime I did ‚Äúapt install eatmydata‚ÄĚ (http://manpages.ubuntu.com/manpages/bionic/man1/eatmydata.1.html). This makes it easy to hide all fsyncs. We can just add this to Jenkins config for new jobs in the job environment plugin, so all jenkins jobs don‚Äôt fsync:

 

LD_PRELOAD=libeatmydata.so

 

This trick may be interesting for others, too. Steve Rowe?

 

To test the difference, I will now run the jenkins server for a day, measure number of reads/writes from smart output and then enable this for the linux jobs (it’s easy in the groovy file that selects the random JVM).

 

The VMs for Windows, Mac, Solaris have the virtual disk already configured to ignore any device syncs.

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [hidden email]

 

From: Michael McCandless <[hidden email]>
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <[hidden email]>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <[hidden email]> wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:


> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]