[solved] Problems with one of the SSD virtualisation segments

24/02/2016 - 15:50 - Momentarily we're experiencing problems with 2 physical hosts in the one of the SSD virtualisation segments. This could result in connectivity problems with your server. We're investigating what's going on, as soon as we know more we will update this post.

++++

24/02/2016 - 16:03 - 3 physical host went down 1 minute after each other for unknown reasons. 2 servers are running again, we're busy bringing up the 3rd physical sever.

++++

24/02/2016 - 16:13 - 1 physical server is not booting. We are preforming a hardware replacement right now. We will update this post as soon as the server is booting again. 

++++

24/02/2016 - 16:24 - The last physical host just came up, we are starting all virtual servers right now. We re expecting that all virtual servers are running again within 10 minutes.

++++

24/02/2016 - 16:30 - All virtual servers are running again. We're going to investigate the root cause of this incident.

++++

25/02/2016 - 15:30 - Please find the root cause analysis of this incident below.

About 1,5 hours before the incident we removed unused SSD's from multiple systems, including the 3 servers impacted by this incident. We remove disks from servers all the time for replacement or re-allocation and this should not and has never caused any service impact. However, we have reason to believe the crashes are somehow related to this maintenance and could indicate a bug in the firmware of the RAID controller which was somehow triggered by removal of a disk.

Despite our comprehensive logging infrastructure we've unfortunately been unable to find any hard evidence for this. Not in the system logs of the hypervisor, on our syslog server, the event logs of the RAID controller or the IPMI interface. 

We have checked the RAID controller firmware changelog and found no indication that this is a known issue or that a fix has been implemented. Since we swap and remove disks from systems with identical controllers all the time without any issues it's impossible to determine if this is really what caused this incident.

We've also checked and double-checked all possible external factors:

1) The server racks were closed and nobody was working near these servers at the time of the incident.

2) We are confident the incident was not caused by power issues. For starters the servers didn't crash at the same time (there were several seconds in between). Our monitoring nor the monitoring of our data center shows any anomalies in power distribution (no dips or peaks). And even if there was: We have 2 independent power feeds to all our equipment to ensure that power issues on one feed don't interrupt service.

3) We have pretty much excluded the possibility that customer VM's or malicious network packets caused these hosts to crash around the same time.

We will take extra precautionary measures to prevent incidents like this from happening again. 

We're terribly sorry for the trouble!

Have more questions? Submit a request

0 Comments

Article is closed for comments.