Disruptions to cloud hosting infrastructure

Categories Service Alerts

Update 7/5/2015

 

An update on the continued cloud disruptions Netregistry customers are facing, our internal technical teams have worked in collaboration with the vendor and we've identified the root cause of the issue.

Frustratingly, although the 'why' has been identified, we can't at this point determine the trigger point, which is the 'how'.

As we have confirmed the root cause we are altering the architecture of our cluster to accommodate the limitation. This work will progress throughout the next 72 hours, and offers a medium term solution to the platform stability while we evaluate a long term resolution.

Tomorrow we are testing a medium term solution, although at this stage our focus is on building and testing to the needed standard before committing to production.

Again we know this has been a frustrating experience and we appreciate the impact it's had on you and your business. It's been trying at this end as well and we are optimistic that the current re-engineering will give us breathing space to work to a long term solution.

Update 6/5/2015


The current update is that we have good news and bad news.

The good news is that we have taken remediation work to mitigate risk of the loadbalancers crashing. This particular line of work started at around lunchtime yesterday and completed around 4pm. Since undertaking this the loadbalancers have stabilized and have operated normally over the last 18 or so hours.

The bad news is that the vendor has not yet delivered the hotfix for us to apply which wil hopefully address the long term stability. We're currently continuing to work with them, which presently includes them having sent an engineer onsite with our network team. As we get any further updates on this piece of work, including when we apply any patch which may impact customers we'll proactively announce via normal channels.

We're optimistic that in the short term we'll see ongoing stability.


 
Original Notice
 
The Netregistry cloud hosting environment is currently undergoing severe disruptions. As end users this will be experienced as hosted websites being 'down'. 
 
We understand the fundamental importance of having your sites available online. We are feeling the pain ourselves because we host our core business sites on the same platforms that we offer to our customers and the vast majority of our business is done online. We get it and we are very sorry. We have committed all available resources to identifying and resolving the root cause of the issue. 
 
For those wanting a more technical explanation please read on. Our cloud hosting platform is 'clustered', this means that many servers are responsible for serving webpages. Traffic to those webservers is managed by load balancers that run in active/passive mode. That is, there is a primary load balancer, with a redundant one available to fail over to. 
 
There is currently an issue that causes our primary load balancer to fail. The nature of this failure then forces the secondary to restart rather than come online. 
 
We are actively working with the vendor to resolve the issue as it's something neither us nor the vendor has seen in a production environment previously. A hotfix was supplied last night which was applied and immediately failed and was rolled back just after midnight. 
 
We are presently working on a short term fix internally to potentially mitigate immediate issues or at least limit the impact to the customer base. We have an ETA of today for a new hotfix to be tested and supplied which we will install overnight to try and fix the root cause of this problem. 
 
We know we've let you down. We unreservedly apologize and are doing all that we can to fix it as soon as possible. 
 
If you have questions or comments, please get in touch. We are listening on feedback@netregistry.com.au
 
Regards,
Brett Fenton
Chief Customer Officer 
MelbourneIT Group Ltd. 
Rate this article
Get more leads