Information regarding recent issues
This is an update regarding the incident that occurred in the Leeds datacentre where we host our shared / VPS websites. On Wednesday the 10th February, emergency maintenance was taking place on a load transfer module. This module feeds power from external energy supplies to the datacentre. Two dual uninterruptible supplies feed the datacentre and these are backed up by generators in case of a National Grid emergency.
A safety mechanism within the device was triggered incorrectly. This caused a power outage of less than 9 minutes. This resulted in a proportion of servers being hard booted.
This is something that has never happened before, it is an incredibly rare occurence and is the worst thing that could have happened at the data centre aside from a fire. A full investigation of how the equipment could wrongly trip the safety mechanism will be carried out.
How were the servers affected by the reboot?
During the reboot, some NAS drives on the shared hosting platform crashed. As a result, these NAS drives could not be recovered. The NAS drives are set up in fully redundant pairs and contain 8+ disk RAID 10 arrays. In all cases aside from one, at least one server in each pair recovered cleanly or in a repairable state and websites were up and running within 2-3 hours.
However, web cluster 75 - 79 did not recover. The NAS drives failed to restore. Further attempts were made to restore these drives whilst at the same time building new ones in case they were needed. Following a false indication that these servers could be restored in a functioning state, further attempts were made to repair the fle system.
Unfortunately, following the repair it was clear that the file system damage was causing the servers to run extremely slow. The only option was to copy backups on to new NAS drives which with the amount of data involved, is a very lengthly process. It was now apparent that the web cluster 75 - 79 would take days to be restored rather than hours. These servers were restored in alphabetical order and were all up and running (read only) by Sunday PM.
A full restore of a shared cluster was a critical incident for the datacentre. The priority was to safely restore the servers as quickly as possible. Investigations into the possibility of splitting the platform and infrastructure servers across two data centre halls are underway which in the event of power loss to one, would enable the datacentre to continue running.
The VPS suffered more damage than the shared servers due to them being unmanaged and therefore not backed up. One issue was that some of the VPS used a certain type of caching that made them more susceptible to data corruption in the event of power loss. This has now been rectified for all current and future VPS.
Due to the power failure, two KVM hosts were lost. These are the host servers that hold the VPS. However, the VPS data itself was not damaged. In addition to this, two KVM switches needed swapping and this caused intermittent issues on the other VPS during this time.
In order to bring the VPS back online, the KVM hosts needed to be rebuilt with VPS data being copied over before they could come back online. Some VPS servers were up and running within a few hours but in most cases file systems and databases had been damaged. We advised that the quickest way to restore the VPS was a rebuild and restore if the customer had backups.
Some customers did not have backups for their VPS. Due to this, the datacentre admins ran a file system check on individual servers to try and get them back online. This would not fix any MySQL issues so a guide was created to show users how to repair MySQL.
We provided the option that the datacentre admins could attempt repair. However, all requests for this were put in a queue due to the backlog of servers needed to be repaired. We believe that all customers who contacted us regarding their VPS, now have them repaired in a working state. If this is not the case, please open a ticket with us.
As VPS are unmanaged, there is no disaster recovery procedure in place. However, the VPS hosts are now set to be considerably more resilient.
Support and Updates
Once the power outage occurred, we endeavoured to assist and update customers as much as possible. Due to our own site and web hosting status being down for a number of hours, we posted regular updates on Facebook and Twitter. We did this more on Facebook as it allows for more characters and fuller explanations.
We replied to every Facebook and Twitter communication that we received and continuously updated these feeds in the following days. Once email communications, our status page and the ticket system were up and running again, we tried our best to update and answer all queries as quickly and efficiently as possible.
This event was unprecedented and therefore replies to your queries and resolving your tickets will have taken longer than usual. Most tickets involved in depth investigation or required datacentre admins to look into them. With the length of ticket queues for datacentre admin assistance, a resolution to some queries took longer than we would have liked and some issues are still ongoing.
We are genuinely, incredibly sorry that this event occurred and thank you for your understanding during this time. The absolute last thing we would ever want to do is let our customers down and we totally understand your frustration during this time.
We can only hope that we can regain your trust. We will always endeavour to keep you in the loop and offer the best service possible. We hope that the further investigations into this incident will assist the datacentre in only providing a good, uninterruptible service in the future.
Thank you for reading this. If you still have ongoing tickets regarding this issue, please be assured that we are working through your queries as quickly as possible. If you have a problem and haven't opened a ticket yet, please do.
Leave a comment
Has this guide helped you? Have an idea for a guide or need help? Let us know below.