A Perfect Electrical Storm?
The interaction of various components, performance characteristics and fault events may force us to reconsider electrical protection in the data centre. The result may suggest that fuses, long considered old-fashioned in data centre power distribution, may prove to have a future in the pursuit of uptime.
In the simplest scenario we could have one critical load connected to one power supply. If that load suffers an internal short-circuit then the electrical protection system disconnects the entire load (e.g. in 0.1s) to limit physical damage and danger to human life. However, the problem we are considering here is if several critical loads are connected to one power distribution point (e.g. a power-strip) and only one of the loads suffers a short-circuit and we want to disconnect it and leave all the healthy loads in service – and we need to review;
The time of voltage loss can the load ride-through?
What happens to voltage during a short-circuit?
How is the load disconnected and how quickly?
Are there safety implications of fast, or slow, disconnection?
The answers to these questions are important to the ‘survival’ of the remaining healthy loads. In this context the term ‘survival’ means that the function of the microprocessor is not disrupted in any way and each critical load has the ability to ride-through voltage events using its on-board energy storage.
Ride-through capability of ICT hardware on zero volts
The first part of our story began in 1978 and concerns the ride-through ability of a microprocessor during a loss of supply voltage when the IEEE Industrial & Commercial Power Systems Conference introduced the first version of the CEBMA Curve (Computer Business Equipment Manufacturers Association). This illustrated the voltage and time window for ‘acceptable’ operation and the key piece of information was the ride-through time at zero-volts of 10ms. Nothing much happened for five years until the Federal Information Processing Standard 94 (FIPS 94) which included the CEBMA curve.Then the IEEE Working Group developed IEEE Std.1100 – 1992, ‘Powering & Grounding Sensitive Electronic Equipment’. Whilst being an ANSI standard this was still the only reference point for the whole world and It continued to be 120V/60Hz with 10ms zero-volts.
In 1996 the Power Electronics Application Center of the Electrical Power Research Institute announced the results of work of the ESC-3 WG of the Information Technology Industry Council (ITIC, formerly CEBMA) on testing susceptibility for personal computers. The result was a revision in 1999 of the ITIC/CEBMA Curve in IEEE Std.1100. This revision increased the zero-voltage immunity from 10ms to 20ms. Again, in Europe, we embraced the 20ms without worrying about the V/Hz details.
By 2007 many pundits were openly questioning the 1999 ITIC curve and suggesting that the true ride-through of Servers was much longer than 20ms. However, when the ‘change’ came in 2013 one server OEM let-slip that their current range of servers would not meet the 20ms specification if fully loaded but would still meet the 10ms limit. They also said that when not fully loaded they could achieve more. The ‘reason’ given was that in pursuit of energy efficiency input capacitor had shrunk in – affecting energy storage and losses. The announcement of this retrograde step in ride-through capability by one OEM did not produce the anticipated reaction from the server competitors/community since they were all in same boat in pursuit of Energy Star ratings.
The latest chapter in this ‘ride-through’ story is now being written by The Green Grid and their Power Working Group, with the preliminary results being varied and surprising. Whilst we will have to wait for the full picture the trend for some hardware is not ‘longer’ but ‘shorter’ and not entirely related to load, with a zero-voltage ride-through of only 6ms being a possibility in individual cases. So, for our purposes here, we will assume that latest ride-through is less than 10ms and more than 6ms at 100% power supply load.
Short-circuit effects on the power distribution system
Servers utilise single-phase switched-mode power supplies and so the most usual short-circuit event is a single-phase to ground/earth. The ground/earth conductor provides a very low resistance path between the voltage source and the ground/earth point. The instantaneous response (<5ms) response to the short-circuit is that the current rises dramatically and the voltage collapses close to zero.
How high the current reaches depends upon the impedance of the source and the impedance of the circuit from the source, the arc resistance and back to the ground/earth connection. Feeding energy into this ground/earth fault path is the source whose peak current under a short-circuit condition at its output terminal connections is dependent upon its ‘impedance’ or ‘sub-transient reactance’. A typical commercial distribution transformer will have an impedance of 6-10% and will therefore be capable of producing 17-10 times full-load current for the first few milliseconds, during which time the voltage will have collapsed. The closer the load is to the energy source the higher will be the short-circuit current.
So, do we need high or low fault current? That depends upon the time in which we want to disconnect the load. In commercial systems, the designer usually chooses to set the protection devices to disconnect the load in 0.1s and that is possible with 3 to 5 times rated current with an automatic circuit breaker set to ‘instantaneous’.
Speed of disconnection
Our data centre problem is simple and yet quite unique in electrical services designs. When a short-circuit happens the voltage collapses and the other (healthy) loads can only continue to operate for <10ms.
For me, this part of our tale began in the very early 90s when working with rotary UPS. They have low impedance (typically 13-20x rated current) and can rupture large fuses without recourse to the utility, unlike static UPS and so it became usual for all rotary UPS OEMs to make a show of fuse-blowing at works witness tests. But why fuses? Who uses fuses in data centres and why not use circuit breakers? Well, fuses are much faster.
Around 1992 tests were carried out at the Falcon Laboratory in Loughborough UK between 3x400A fuses and a 400A Moulded Case Circuit Breaker with 50kA of short-circuit current. 400A is a large fuse/breaker to disconnect but is representative of a floor-mounted PDU supporting 250kW of load. The test measured the total clearance time of the devices, from the instant of the short-circuit current commencing to when the load was disconnected (arc quenched) and the voltage was restored to normal. The result was not surprising given the simplicity of the fuse compared to the current sensing time and electro-mechanical time constant of the MCCB:
Fuses cleared in a total time of 8ms
MCCB set to ‘instantaneous’ trip cleared in 17ms
Clearly 17ms is far too long a time for the voltage to collapse if we are trying to restore the voltage in <10ms. It is also clear that trying to protect loads from a single server failure is best achieved by protecting it by a fuse that is as physically close to the load as is possible and with an Amp rating as low as is possible.
Safety related impacts – arc-flash
In the pursuit of fast disconnection, we can see that minimising the resistance of the power system distribution path maximises the prospective fault current. In a growing number of markets the issue of ‘arc-flash’ has become a safety issue and the level of fault current drives the severity of the incident.
Arc-flash is a high energy event propagated by a short-circuit fault. The short-circuit current ionizes the air between two conductors and the effect is akin to an explosion with gases, heat, light, shock-wave, shrapnel and vaporised copper. The injuries sustained by an arc-flash event include severe burns, temporary blindness, hearing loss, lung damage and barotrauma. However, arc-flash PPE is only capable of protecting the person to ‘just survivable 2nd degree burns’ so in no way must the use of PPE be regarded as a first-step in a safe working practice, rather a last-step.
The arc-flash risk is measured in Calories/cm2 at a calculated boundary and the disconnection time is key to the result. For example, if we take a LV switchboard with a peak fault current of 65kA having a short-circuit in a switch with a technician 450mm away from the root of the short-circuit. The calculation of three values of disconnection times (for fuse and circuit breaker) shows the advantage of speed:
0.19s (a ‘MIN’ setting in a circuit breaker) results in a boundary of 3,100mm and energy of 20 Cal/cm2. This would require Class 3 PPE which includes cotton underwear, flame resistant shirt, trousers and overalls, hood & mask and gloves.
0.05s results in a boundary of 1,075mm and energy of 5.3 Cal/cm2 and would require Class 2 PPE.
0.01s (easily achieved by a fuse) results in an arc-flash boundary of 360mm and incident energy of 1.1 Cal/cm2 requiring Class 1 PPE.
It is obvious that fuse protection reduces the severity of arc-flash.
The zero-voltage ride-through capability of ICT hardware is falling below 10ms. If we want to achieve fast disconnection during an individual load short-circuit then we need to maximise the fault current at the individual protective device and that will be best provisioned by a fuse rather than an automatic circuit breaker. The application of fuses will also reduce the severity of arc-flash. The change to fuse protection would be a major event in data centre power system design.