Is the ‘Cloud’ More, or Less, Resilient Than Your Own Data Centre?
The latest outage in Microsoft Azure services, 31st March, was in Japan and lasted over 7 hours until ‘most’ services were back on line. This follows a similarly long Azure outage in 2014 that was eventually blamed by Microsoft on ‘human error’. The press release makes interesting reading and this month I will attempt to pick through the snippets of information and come up with a slightly more useful lesson-to-learned on the press release’s title; ‘UPS failure causes cooling outage’.
Of course, 7 hours of downtime in a year is only 99.9% Availability – much lower than any end-user would accept from their own facility - and If you consider a ‘finance’ application then a failure once every couple of years, regardless of how long the outage was, would be beyond a disaster. This raises the interesting point that people choosing ‘cloud’ versus in-house either don’t seem to realise that ‘cloud’ is just someone else’s data centre or they focus on a contract littered with SLA’s and penalties and believe the salesman’s puff about the reliability attributes of ‘cloud’. Very few buyers of cloud services will ask to see the facility – and where would the salesman take them? It is a cloud, after all, floating, fluffy and nebulous… In fairness, I don’t think that MS Azure, on their current record, achieves nowhere near the Availability offered (and achieved) by most of the collocation providers whilst most prospective cloud purchasers don’t have their own facility to compare anything with. The cost of collocation is certainly a lot less than building your own and, importantly, comes out of OpEx rather than CapEx.
So, what about this latest failure? Well, you can find one version of the press release here: https://data-economy.com/microsoft-azure-customers-hit-data-centre-outage/.
There is one salient point; the failure resulted due to a lack of cooling, not one of loss of voltage, and that the cooling system was powered by the UPS system – a rare solution only reserved for high-density applications. Now, the cooling system (all that I can think of) doesn’t need a UPS system for ‘continuity’ of voltage like a server (10ms break and it is ‘goodnight Vienna’) but is only ever needed to avoid rapid increases in server inlet temperature in high-density applications (>10kW/cabinet) whilst the cooling stops on utility failure and before the generator jumps-in (10-15s) and then cooling system regains full capacity (5-10 minutes even in an old-technology chiller). In this case, where the cooling zone was off-load for hours, clearly UPS wasn’t actually needed for the cooling system so switching it onto a utility feed might have taken 20 minutes once the problem was noticed. It appears that MS-Azure actually spotted the ‘loss of cooling capacity’ (only a part of the data centre) from a remote location that was a couple of hours drive away
Then, for reasons that are not clear to me, they point out that the UPS that ‘failed’ was ‘rotary’ and specifically ‘RUPS’ - which isn’t a recognised term (it is either Hybrid Rotary or DRUPS, Diesel Rotary UPS) but all types of UPS ‘fail’ by transferring the critical load to its automatic bypass. This slight mystery is compounded by the statement that the UPS was ‘designed for N+1 but running at N+2’. This would infer partial load in the facility and a slight disregard for UPS energy efficiency as turning an un-needed UPS module ‘off’ would raise the load on the remaining system and save power – something particularly useful with rotary UPS as partial load efficiency is not a strong point. But the strange situation is that I don’t know of any UPS (type, topology or manufacturer) where one module in an N+1 redundant group trips off-line and doesn’t leave the rest of the load happily running at N – or in this case dropping from N+2 to N+1. Add to that the statement that only a part/zone of cooling capacity dropped off-line.
In fact, there is one ‘rotary’ solution that fits this scenario and that is DRUPS with a dual-bus output, one ‘no-break’ feeding the critical load and one ‘short-break’ that supports the cooling load after a utility failure has occurred. Whilst the ‘short-break’ output is a single feed the section of the cooling load is, assuming the system was designed properly, always dual-fed across two DRUPS machines and so should have simply transferred automatically to a healthy DRUPS machine in the remaining N+1 group.
But, so what? The press release clearly states that the site personnel (not MS-Azure but a 3rd-party facility management company) incorrectly followed an emergency procedure to regain cooling capacity and that ‘the procedure was wrong’. Then they had to wait for MS staff to arrive and fix the problem – something which, no doubt, involved switching circuits that had failed to switch automatically.
Could the local staff be described as ‘undertrained, unfamiliar and under-exercised’? But, if so, whose fault is that? Certainly, the failure has little to do with a UPS of any type. It may have set off a chain of events that turned what should have been a heart-racing 15-20 minutes’ recovery procedure into a 7+ hour mini-disaster. Mentioning the UPS in the press release takes the eye off the underlying problem and my view is that it would appear to be, as usual, 100% human error, and several of them. The designer made it too complicated by having UPS fed cooling that didn’t respond well to a UPS ‘going to bypass’ event. Someone wrote an emergency recovery procedure that had a mistake in it. Someone made the decision not to test the procedure(s) in anger, either at the commissioning stage or later. The local technicians were not allowed to simulate failures in a live production scenario and train in the process so that when the procedure failed, they didn’t have the experience of the system to get around the problem. Human error. Latent failures, just like this example, are exacerbated by not testing the system in anger on a regular basis, thus keeping your technicians aware, agile and informed.
So, what about the question posed in the title? You have no way of telling, but as services are increasingly commoditized I would suggest that the answer will increasingly become ‘less’. Don’t forget what John Ruskin (1819-1900) said: ‘There is nothing in the world that some man cannot make a little worse and sell a little cheaper, and he who considers price only is that man's lawful prey’. Or, ‘you get what you pay for’, but my favorite Ruskin quote is ‘Quality is never an accident; it is always the result of intelligent effort’…