BlueOnyx Service Outage

07 Nov 2018 Posted by: mstauber Category: General

We're currently experiencing a server outage that affects part of the BlueOnyx infrastructure.

On 6th November at around 22:00 US Eastern time one of our Aventurin{e} nodes went down and now refuses to start due to issues with either the disks or the RAID-controller. That node was hosting 14 VPS's, six of which are critical to the BlueOnyx project.

Affected are the BlueOnyx Mailing-List, BlueOnyx email accounts, Solarspeed.net email accounts as well as parts of the YUM repositories.

We are currently working on restoring the VPS's from the daily backups and the most critical services should come online again within the next couple of hours.

This news article will be updated while we work on the issue. Many thanks for your patience.

Update:

As of 02:00 US Eastern time all critical services should be up again, but as we're over-taxing the hardware by running more active VPS's on fewer nodes the quality of service might get impaired a little. At least until we can return the failed node to service. Which should happen sometime during the morning hours.

2018-11-07 14:00 EST: Our provider Virtbiz.com was so kind to set up a new Aventurin{e} 6109R for us and we're now in the process of migrating all VPS's from the backup node over to it. As this involves moving around 800GB of data that move will take some time and brief temporary service outages are expected.

2018-11-08 14:00 EST: The last VPS's (which were of lesser significance and not publically exposed) have been restored from the backups and we're finally returning back to normal operations.

There is still a days worth of cleanup needed to reorganize or backup cycle a little now that the replaced failed node runs Aventurin{e} 6109R, whereas it was using 6108R before. This gives us the opportunity to implement a newer and more seamless "active" backup method, where our 6109R nodes do snapshotted/rsynced cross-backups that can either be immediately activated on the backup node via "prlctl register", or which can be restored or moved to another node via "prlctl register" and "prlctl migrate" or via traditional rsync.

All in all this server outage had all the ingredients of a proper disaster, but redundancies and having a proven and reliable backup/restore mechanism mitigated almost all ill side effects. I'd like to especiall thank Chris Gebhardt and his crew at Virtbiz.com, who went way beyond the call of duty to get us new hardware and to help us back on our feet again. If you're looking for a datacenter or colocation facility, then that's the place to go. Many thanks!

← Return

General