70% of Failures Due to Human Error? It Should Be 100%...
There have been many data centre ‘failure’ studies over the years that attribute a high proportion of them to ‘human error’, the most widely accepted coming from The Uptime Institute membership with a reported 70%. I personally prefer the (non-published) version from Microsoft in North America which added human and software errors together and suggested that 97% of all their failures in 25+ data centres were down to human error. So, the question arises; should we be aiming for more, or less, than 70%?
If we ignore for one moment which constituents make up human error the answer must be that human error should, in a perfect data centre power and cooling infrastructure, be 100% - because then the designer has created perfection – concurrently maintainable and fault tolerant. It is only then that the dangers appear and, paraphrasing Lee Iacocca, who said ‘risks (costs) appear in my business on two feet’.
Having said that it must be said that many ‘data centre failures’ that we all read about in the press are actually not related to the data centre at all, they are software system failures and very often when systems are upgrading. The well-publicized NatWest ‘data centre’ failures of a couple of years ago were entirely related to software upgrades but shut down the access to accounts and the ATM network for days – yet the ‘data centre’ was blamed.
So, what constitutes human error? Probably everything that touches the data centre process, from finance, through design and construction, testing and operations. Can the infrastructure be blamed for a power failure, or the person who cut the budget that prevented enough redundancy to be installed to meet the business case? Or the failure when an operator pushes the wrong button, but it was because ‘someone’ didn’t budget for training or allow sufficient opportunity for the staff to practice ‘live’? In nearly all ‘failure’ cases all roads lead to (Rome) humans error.
Some failures when reported exhibit an ignorance (real or feigned) of the reality of data centre engineering to produce a smoke screen for the data centre operator to hide behind. The latest example was in the last month where a power utility failure was blamed for the data centre losing power to the critical load. After the ‘we apologize to our clients for these external and unexpected events’ message the proposed solution was ‘to correct inadequate investment in the past and install a second utility feed’. At least a nod to the lack of investment being to blame – although not in the right place! Utility failure (or significant transient deviation) is a normal and 100% expectable event in every data centre so adding another connection wont help in any way whatsoever. No, this data centre problem was clearly related to the emergency diesel system not backing up the utility. So, what not blame that? I can’t say with certainty but, over many years, I have seen numerous EPG ‘failures’ that are actually ‘human error’ related such as:
Lack of maintenance to starter battery and charger
Lack of care of the fuel quality/contamination
After maintenance not switching the charger back ‘on’ or the system back into ‘auto’
Lack of monthly system testing, starting on load
Lack of emergency testing of generator switchgear (not the single sets)
You can see that these ‘generator failures’ all start to be ‘human’ related.
So how can we reduce human error? We can certainly design-out the opportunity for operational error, albeit usually at higher Capital Expenditure such as a fault-tolerant power and cooling system – although that must have a fault tolerant control system to match, something which is usually missing from, so called, Tier IV (apologies to Uptime for the abuse by use of ‘Tier’ and Roman Numerals) facilities. We can also reduce human error by best-in-class operations which include training, re-training, up-to-date documentation and SOPs/EOPs that are backed-up by regular live testing and emergency simulations. Then, at least, most failures will occur at a known date and time when everyone can be fully prepared for a brief outage rather than at a random instant which may dramatically impact the business just when it is most dramatic.
But, ignoring software problems, how can we reduce data centre systems failures, human errors or combinations of them both? In my opinion herein lies the greatest opportunity for improvement and a new venture spearheaded by Ed Ansett (known to many from his EYPMCF days in Singapore and London and, now, i3) – the sharing of operational experiences and, to be clear; the sharing, in detailed facts rather than marketing spin, of data centre failures for the common good so that each can learn from others.
This venture is a not-for-profit organization called DCiRN, the Data Centre Infrastructure Reporting Network. It is currently free to join although, at some time in the future, it will eventually have to involve some small annual fee to cover the administration costs, and you can find the joining instructions at www.dcirn.org.
But how will it work? The inspiration behind it is the airline/maritime industry that has a strong record of continuously improving passenger safety by the sharing of accident and potential accident information using an anonymized system called CHIRP and, yes, you could call it a whistle-blowing system. However, the same is available in the data center industry where it is common practice to cover up failures or potential failure incidents in a misguided attempt to protect reputations. Root-cause investigation findings are normally secret and bound by NDA which has resulted in the prevention of learning from failures. Whilst CHIRP is aimed at human safety, data centres support every aspect of the digital economy and, as we become more reliant on them, for example with self-guided cars, it’s only a matter of time before a failure will be associated with human fatalities, hence DCiRN and we need to act sooner rather than later, there’s no reason why our archaic secrecy should continue. Working with globally recognized industry leaders as advisors and editors of the confidential reports that are submitted (not involving equipment OEMs in any way) - DCiRN is a forum for the exchange of information between Data Center operators around the world to encourage the confidential sharing of information about Data Center failures so that lessons can be learnt, and failure rates reduced.
The system is simple. Anyone can download the form from the web-site and submit a report. It will be analyzed and anonymized by the Advisory Board and then, ONLY if anonymity of the reporter and the location can be 100% guaranteed, the incident will be published and circulated, free of charge, to all members.
Will it work for all incidents? Probably not. Will it prove of value (the education and prevention of the same error) to all members? Definitely yes. I for one will certainly be a supporter of the principle and process and encourage all my clients to use the system.