Is Resilience Misrepresented, as Well as Misunderstood?

Jan 26, 2018
4 min read

When it comes to data centres the word ‘resilience’ can be best defined as ‘the ability to maintain ICT service in the face of environmental extremes as well as human error or deliberate sabotage’ and, generally, higher levels of resilience can be engineered into the mechanical and electrical infrastructure at a cost premium. However, ‘human error’ is well documented to be the root cause of 70% of all data centre ‘failures’ but even that can be reduced by design e.g. a dual-bus power system with a UPS in each bus can largely protect a correctly connected dual-corded load against power failure, human error and inept sabotage but you probably notice how careful I am with the caveats…

Of course, if you are a client/user of a data centre you clearly want to know what you are getting for your money, not least so that you can pay for what you deserve but, In the wonderful words of John Ruskin (1819-1900) ‘There is nothing in the world that some man cannot make a little worse and sell a little cheaper, and he who considers price only is that man's lawful prey’. In modern parlance ‘if you pay the lowest price you are usually buying rubbish’.

So, how to differentiate between systems? Well we have two ‘metrics’, somewhat interlinked and both abused;

The ‘Tiers’ of Uptime (I-IV), the ‘Types’ of TIA-942 (I-IV), the ‘Rating’ of BICSI (0-4, although ‘0’ doesn’t describe a data centre, so 1-4) and the ‘Availability Class’ of EN50600
Availability percentage, e.g. 99.999% (the so called ‘five-nines’)

Apart from pointing out that the Uptime rules are no longer written down for public consumption, TIA-942 & BICSI are ANSI Standards most applicable in north America and that EN50600 isn’t yet used much we can distil them all into ‘four levels describing the capability of ‘concurrent maintainability’ & ‘fault tolerance’. The principles are clear; concurrent maintainability answers the question of what is the point of building a hugely reliable (and maybe resilient) data centre that must be shut-down once a year for maintenance? Whilst a fault tolerant system can have any component, path or space ‘fail’ (one at a time) without impacting the ICT service.

But the greatest abuse is reserved for Availability percentage; easy to calculate but capable of huge misinterpretation to fool the unwary. The first problem is that to state an Availability you need just two numbers, the MTBF (mean time between failure, hours) and MTTR (mean time to repair, hours) and you simply express the Availability by dividing the MTBF by the total time (MTBF+MTTR) and multiplying by 100%. So, having a very long MTBF and a very short MTTR gives you an incredibly high result. Unfortunately, both MTBF & MTTR are numbers that marketing departments can guess at, if they use them at all. For example, you can quote 99.999% for a UPS simply by assuming that the client has the skills and spare parts on site and can repair it himself in 20 minutes, instead of calling the service engineer, waiting for spare-parts and then re-testing before putting back into service (often one day or longer).

The second problem is a combination of the number of failure events (summing multiple MTTRs) and the MTBF. The original Uptime white paper (now withdrawn) had an attempt at linking Availability % with the four Tiers but didn’t define the period over which it would be measured. This led to the strange scenario where a low Tier facility would offer to be off-line for 53 minutes per year but the ultimate ‘IV’ would offer only 5.3 minutes. How bizarre was that? A failure once a year is a disaster, for any ‘Tier’.

Anyway, let’s not dwell on that but consider the combination problem. This particularly impacts numerous very short-lived failures. The easiest way to illustrate it is to suggest that your heart is 99.9% ‘available’. Doesn’t sound ‘too’ bad until you consider that it represents 36,000 missed heart beats a year and that if they are missed in one session you are very dead whilst if they are evenly spread over the year you are just feeling unwell. In data centre terms look at the voltage supplied to the load. Many modern servers cannot withstand a break in supply for longer than 10ms (millisecond), and some considerably less, so offering a 99.9999999% Availability in the power system (9-nines) could still produce three failures every year, each lasting 10ms.

So what to do? Well, there is nothing wrong with Availability as a metric as long as it is ‘clear’ what it is based upon. For example; ‘an Availability of 99.99% measured over 10 years with a single failure lasting no longer than 10 hours’ is a clear statement of MTBF (10 years) and MTTR (10 hours). OK, the marketing boys and girls may have rounded the answer from 99.98859…% but you may, by now, be getting the point that it is the MTBF that is more important than Availability and, to boot, you need the MTBF to calculate the Availability in first place. The ‘single failure’ caveat avoids summation of multiple events.

The next time someone offers you 99.999% of anything just ask them ‘over what period’ and watch their expression change – it can be fun.

Of course, the ultimate ‘failure’ of a resilient data centre is the easiest to achieve: It is not hacking into the UPS and turning off the power or (as in a recent movie) raising the server inlet temperature to get melt-down. No, just consider the definition of a data centre: A facility housing compute, storage and I/O connectivity, right? So, walk round the outer perimeter of the property noting the location of the fibre pits and return later that night with a few chums each armed with a balaclava, a few gallons of unleaded and a box of matches. Grenades would be better but my local garage doesn’t sell them. Whip up the (unlocked) cast-iron pit lids and within seconds you are fleeing the scene and the data centre is disabled for several days. The same principle applies to those strange folks who want to build an earthquake-proof facility. If the earthquake hits your location it will almost certainly sever the fibre and, without connectivity, a data centre is reduced to a secure depository for second-hand ICT kit…

#English

Is Resilience Misrepresented, as Well as Misunderstood?

Recent Posts

Comments

Post Archive

Tags

Kontakt