We've had 3 network outages in the last 2 weeks related to a crash of one of our core switches. It was caused by a firmware bug. We have opened a high priority case (P1) with the vendor of our networking equipment (Juniper) immediately after the first outage and they are currently working on and testing a fix. They have confirmed all 3 crashes are caused by the same firmware bug. At this time we cannot tell when we will receive a new firmware image with a fix for this issue. We certainly hope they will provide one as soon as possible, but it could take one or two more weeks before the fix will be available for us.
So in the meantime we're exploring our options to limit the impact of further crashes as much as possible, since there is a high risk of more of these outages happening before we can actually implement the fix.
We will update this post as soon as more information becomes available.
We're terribly sorry for all the trouble.
++++ Update July 9 2013, 12:00 +++++
We've just ordered new hardware to build a brand new 10 GbE core layer based on Juniper EX4550 hardware. This new multi datacenter core layer should become rock solid with our own redundant dark fiber links between it. With this new layer in place only a part of our network in impacted when the firmware bug exhibits itself. Meanwhile we're still working on a fix for the firmware bug. Juniper is prepping and testing a custom firmware version with a fix in it.
++++ Update July 16 2013, 12:45 +++++
All hardware for the new core layer is in place, we're busy finishing the configuration of it. In the night from Friday the 19th to Saturday the 20st of July we're going to put the new layer into production. (If you're impacted by this change you've received a maintenance notification.) The problem with the old core switches has less impact after that.
Meanwhile we're still working on a solution to improve stability of the old core switch, which still serves a part of our network after the implementation of the new core layer. Juniper was able to reproduce the problem and is testing a fix for it. As soon as they are done testing it we can deploy it on our impacted switches. When we have a release date for the custom firmware we will inform you about it in this post.
++++ Update September 9 2013 +++++
Juniper has released a new firmware which we've deployed on the switches that were most impacted by this issue during an emergency maintenance window. Up till now the firmware has proven stable and we're now moving forward with deploying this firmware on our other switches as well.