As You Design for a Low PUE Do You Sacrifice Availability?
As you design for a low PUE do you sacrifice Availability?
A few weeks ago I was pleased to be asked to speak at the opening ceremony of the new EVRY data centre facility in Fetsund, Oslo – complete with lots of razzmatazz hosted by a TV chat-show star and even with a government minister rolled-out especially for the event. As I sat waiting for my slot on the agenda I listened to the opening address and a presentation by the developer and a phrase popped into my ears that made me adjust what I was going to say when it was my turn. The phrase was ‘no compromise’ and it was aimed at the facility design.
Now, my talk was centred on the reasons for data centre power growth and the reaction of governments, especially the EU, to what they see as a growing threat to their carbon reduction initiatives. So, being in Norway, I was going to talk about all of the advantages of locating in a place where zero-carbon renewable power is abundant, the climate is cool (not to say cold!), the air is clean and there is a pool of well-educated potential employees. I also included a justification or explanation (but not excuse) as to why the colour and cost of energy is still not, and may never be, the main driver in choosing a data centre location, but that is another subject for another day. When it came to the point in my presentation regarding low-PUE I picked up on the ‘no compromise’ theme and made the statement that getting a lower PUE would be no problem...
So, ‘no compromise’ power & cooling? Low PUE? High Availability? All together at the same time? Whilst there are a few notable exceptions that do pursue low PUE almost at the expense of Availability (or solve the riddle in a different way) but in >99% of the instances the user wants high Availability above anything else, often despite not having the budget to pay for it!
What occurred to me as I sat listening was that I was about to get up and boast about a PUE that was certainly going to be less than Google’s 1.12 once the load rose above the first few hundred kW – and yet, if that was the primary target, I should have been boasting about a PUE similar to eBay’s remarkable figure of 1.07. That’s because it’s not hard to achieve but it does have consequences that some users would shy-away from; consequences of higher risk of mission failure and lower Availability.
So, why will the EVRY Fetsund facility have a PUE nearer to 1.12 than 1.07? Let’s see... and in the following sections the term pPUE (‘p’ for ‘partial’) describes a sub-set contribution to the overall annualised PUE – and you can get a full description in the international standard ISO/IEC 30134 ‘KPIs for Resource Effective Data Centres’, where 30134-2 covers PUE.
Power
Until the advent of transformer-less modular UPS, where the machine itself adjusts the number of active modules to the applied load to keep the standing losses low, the answer to the question is ‘yes’: High levels of redundancy reduced the UPS module load and partial load made it worse so that low efficiency was endemic. High component count is one drawback of modular UPS so, in theory at least, the efficiency gain from the modularity under partial load comes with an impact on Availability.
Consider a legacy monolithic transformer UPS whose full-load efficiency is 92%, 40%-load efficiency 89% and 20%-load efficiency often little better than 85% and, therefore, for just the power system from UPS input to load, the typical contribution to the facility pPUE was 1.10 and the partial load pPUE was higher than 1.20. In the worst case of a 2(N+1) dual-bus system (akin to the ‘original’ Tier IV requirements of Uptime Institute) a UPS module spent its entire service life at well below 20% load so the difference between the most efficient (but least reliable) ‘N’ solution pPUE of 1.10 and the least efficient yet most reliable 2(N+1) solution of higher than pPUE of 1.25 is higher than 0.15 – with low ICT load making the difference greater.
Today’s best practice UPS closes the gap, but a small gap remains nonetheless:
Consider a modular transformer-less UPS whose VFI (series on-line) N+1 full-load efficiency is 96.5%, 40%-load efficiency is 97.0% and 20%-load efficiency 96% and, therefore, for just the power system from UPS input to load, the contribution to the facility pPUE is 1.04 and partial load pPUE is 1.05.
In the worst case of a 2(N+1) dual-bus system (akin to the ‘original’ Tier IV requirements of Uptime Institute) a UPS module spends its entire service life at below 20% load so the difference between the most efficient yet least reliable ‘N’ solution pPUE of 1.02 and the least efficient yet most reliable 2(N+1) solution pPUE of 1.04 is very small at 0.02 – with low ICT load hardly making an impact.
Now some design engineers would say that modular transformer-less UPS is less reliable (higher component count etc) than a monolithic array of larger modules, with some (especially in North America) even arguing that a transformer solution is superior - although I, along with most in Europe, would debate this, but nevertheless a small impact on reliability could be argued for the improvement from 0.10 to 0.02. Bigger than that is the gap between on-line UPS (voltage and frequency independent, VFI) to line-interactive (voltage independent, VI) at 98% efficiency and, further, to eco-mode/stand-by (voltage and frequency dependent, VFD) at 98.7%.
Yet more important is the application of dual-bus, A & B, power that feeds dual-cord loads. Whilst the UPS system(s) have to operate with never more than 50% of the load, which often results in the real world at less than 20% load each, and there will be a negative impact on efficiency the operational and reliability enhancements are huge – near to a 10x improvement in reliability and a huge reduction in the chance of human error, which, according to the Uptime Institute, accounts for more than 70% of all data centre service failures.
Therefore we ‘could’ draw a progression from legacy monolithic transformer based UPS with a pPUE contribution of 1.20 to best-in-class modular transformer-free UPS in eco-mode of 1.01 – a substantial improvement of 0.19 on the full-load PUE along a path of increasing ‘risk’, be that ‘real’ or ‘perceived’.
Of course, at EVRY Fetsund it was never envisaged to apply legacy UPS systems but the result of the ‘no compromise’ design brief was to install dual-bus 2(N+1) modular transformer-less UPS capable of pPUE contribution of 1.05 instead of single-bus N offering 1.03 plus the advantage of the chances of human error being drastically reduced. But the ‘no compromise’ doesn’t end there because it was decided not to use line-interactive (VI) UPS which is 1.5% more efficient nor to enable the eco-mode feature on the UPS which, by utilising the mains power directly when it is within voltage quality limits, raises the module efficiency by 3.5%.
Eco-mode is an interesting debate. All high-end static UPS have the feature as standard but it can be disabled by the user. The principle is simple; when the utility voltage is within acceptable limits (10ms zero-volts capability of the ICT load) the UPS automatically switches itself into bypass. When the utility varies (voltage level and frequency) the UPS automatically reverts to on-line VFI operation with a maximum voltage deviation (worst case ‘break’ of 4ms. One OEM has a patented 2ms limit but that will, no doubt, soon be emulated by all the others. This 4ms is no different from the normal UPS operation of bypass operation for fault or maintenance but the concept is, so far, not often applied. At the typical European electrical energy cost the increase in efficiency from 96.5% to 99% provides a pay-back of 100% of the UPS capital cost in <3 years and reduces the potential PUE by 0.035.
The ‘risk’ from enabling eco-mode is ‘real’ but the ‘perception of risk’ is higher: Every time a deviation in the utility occurs the UPS has to sense, decide, activate and successfully complete the transfer, both ‘on’ and ‘off’ eco-mode, so a chance of failure occurs. In a location of low utility quality (e.g. Caribbean island) the number of chances of failure can be several times per day compared to a location like Norway where the chance may be once per year. The enablement of eco-mode is, in my opinion, a huge cost opportunity, especially in a proven high-availability hydro-dominated utility like Norway’s, with a very (very) low impact on risk. At EVRY the risk would be further minimised by the separate A & B systems with the chance of simultaneous operational failure in both as close to zero as is possible.
So, to summarise the power considerations, EVRY could have a specified a system capable of a pPUE as low as 1.01 (single-bus N, transformer-less modular, eco-mode enabled, not concurrently maintainable and open to human error) instead of a pPUE of 1.05 (dual bus 2(N+1), no eco-mode, concurrently maintainable and fault-tolerant) – a contribution of 0.04 to the ‘no compromise’ design philosophy.
Cooling
Usually the cooling system is the first place to go to achieve data centre cooling energy savings. The key to that saving is ‘simply’ to contain the airflow and run the facility as hot as possible to maximise the free-cooling opportunities in the given location climate.
The first problem that carries ‘risk’ is the setting of the server inlet temperature and humidity. The legacy specification of 21⁰C±1⁰ and 45-55%RH (which dates from the mid-50s IBM mainframe requirements for tape-heads and punch-cards) is still used today in facilities where taking any ‘risk’ is not acceptable. However, despite the latest limits (both Recommended and Allowable embodied in the 2011 Thermal Guidelines of ASHRAE) being much wider, many users still cling to the legacy specification with the result that chilled water temperatures are low enough (e.g. 6⁰C flow) so as to only permit a minimum of free-cooling, where compressor driven refrigeration is not used.
Therefore, even in Oslo, a pPUE for a chilled water cooling system with legacy set-points, no aisle containment and no free-cooling coils would be in the order of 1.5 at full load and considerably more at partial load. However, although not uncommon in facilities >5 years old most (but by no means the majority of) modern facilities are based on the more enlightened ASHRAE wider ‘Recommended’ limits of 18-28⁰C and 20-80%RH. Using this specification in Oslo with high chilled water temperatures and free-cooling coils it is possible to reach a pPUE of 1.25 at full load and with redundancy although partial load pPUE can still be a design challenge.
At the EVRY facility they have chosen a solution where the server inlet is 25⁰C, lower than the ‘Recommended’ limit of 28⁰C, with humidity control 40-60% and tight aisle-containment. This is fed by a compressor-less indirect cooling system based on evaporative cooling on the external heat rejection circuit. The most important feature of this cooling solution, apart from the pPUE being 1.08 at full load but lower, around 1.04, at partial load, is that outside air is not introduced into the critical space. There could have been the opportunity to save cooling energy by using a direct air system (letting outside air into the cold-aisles after minimal filtration etc) but that would have been a compromise for corrosion and contamination of the ICT hardware for only a 0.015 advantage in pPUE – deemed too high a risk to the load for such a minimal benefit. It is worth noting that some very large facilities in Scandinavia use fresh-air cooling (e.g. Google) but their hardware refresh cycle is so short as to not let corrosion and contamination gain a foothold in time.
So, to summarise the cooling pPUE, we can see a line stretched from ‘no risk, legacy, bad-practice’ at 1.5-1.8 through best-practice chilled-water taking some risks (by legacy benchmark) and achieving pPUE of 1.25-1.40 to modern design evaporative/adiabatic indirect air system as fitted at EVRY at pPUE of 1.04-1.08 which (in Oslo, not in other EMEA regions) takes no risks at all, and finally to direct fresh-air where a pPUE of 1.025-1.04 can be achieved by taking contamination risks with the ICT hardware.
Strictly speaking the PUE should include the embedded energy in city-water used and there is a proxy kWh/litre given in the ISO Standard. In areas where a large quantity of water is consumed in the evaporative/adiabatic cooling system – for example where the dry-bulb temperature is high enough during the summer months to need water evaporation to take advantage of the wet-bulb temperature. The consumption in some direct systems can be a not inconsequential 1000m³/MW-cooling/annum but in the Oslo area that time is minimised by the climate. The indirect system at EVRY uses less water but, in addition to that, the facility harvests rainwater (only using the city-water connection for redundancy) and so the energy used for pumping is included in the facility PUE.
There is another aspect of pursuing a low pPUE for the cooling system by choosing an indirect air-based system that cannot be ignored be ignored as it does impact capital cost. Chilled water has 4000x the heat carrying capacity of air at the temperature bands we are considering and, as a consequence, the physical space (particularly floor-to-ceiling height) has to be purpose designed to accommodate the air cooling heat-exchangers. This involves higher construction costs and usually precludes using existing buildings, hence more Capex is required in the construction phase that enables lower Opex in the operational phase.
A ‘good read’ on a winter’s night is 2011 ASHRAE... The table of Recommended and Allowable temperature and Humidity limits is on one page but the other 40+ pages tell you in great detail what are the downsides of pushing the envelopes to their limits.
Fire Suppression
Normally fire suppression takes no energy and does not impact the PUE at all. However the loss of data centre services when inadvertent (false activation) fire suppression release takes place are well documented. To avoid such interruptions and to avoid gas replenishment charges the EVRY facility uses a system of oxygen reduction in the critical space. Reducing the oxygen level from 21% to 16% ensures that combustion cannot start. However, this system absorbs about 10kW of power per 1400kW equivalent of ICT space and so contributes a pPUE of 0.007 to the facility PUE. This expenditure increases availability, reduces the chance of contamination from the chemicals in the usual suppressant gases, eliminates shock damage to hard-drives from the discharge noise of gaseous suppressants, avoids all water in the critical space and cannot be falsely activated as it is active all the time. This is the final example of ‘no compromise’ at the expense of the overall facility PUE.
Summary
If we now add the contributions from the power system, cooling system and fire-suppression system together we can see that IF low PUE was the only target then EVRY could have achieved an annualised PUE of 1.07 by taking risks with single-bus eco-mode power and direct fresh-air cooling or built a legacy chilled-water facility with a PUE of 1.70. In fact, without taking risks with the load hardware and building a concurrently maintainable facility with 2(N+1) power they will achieve a PUE of 1.12. By simply enabling eco-mode that would fall to 1.09.
No compromise? Yes; energy effectiveness, low PUE and high availability in combination.