Liquid Cooling – a future for small data centres?
A Nordic specific introduction
The following reader on the current options for ‘liquid cooling’ in data centres is based on the main data centre locations of London, Frankfurt, Amsterdam, Paris and Madrid but it has interesting conclusions when viewed from the perspective of data centres in the Nordics as ‘district heating’ is a common feature of many Scandinavian cities. What is considered ‘unlikely’ for some time to come in, say, London, is ‘more than possible’ in Nordic locations where the cool climate demands heating and district heating schemes are already in place and have been so for decades. Indeed there are data centres in the region that already push a high proportion of their waste heat into the local heating loop. At the end of the paper is a view of Smart City engineering that could suit the Nordics strongly.
A review of liquid cooling in its various forms
Liquid cooling is currently available in three formats, all of which are in the early stages of market trials within the data centre space.
On-chip cooling
Immersion cooling
Encapsulation cooling
Before we explore each technology we need to note the following principles:
So far, no ICT hardware is built specifically for liquid cooling.Liquid cooling adapts air-cooled products.This is important when we consider one of the main marketing claims of liquid cooling; .This is clearly the case as, for example, any server adapted for liquid cooling does so from an air-cooled basis and the density is only limited by the number of servers (for example) that can be physically loaded into a cabinet.Air-contained cold-aisles can support any capacity of rack power as long as enough air can be presented to match the (variable) demand of the servers themselves.
One advantage for encapsulation and immersion solutions is that the server power drops slightly as the on-board cooling fans are removed, although this is usually nearer 6-8% at full utilisation than the 15-25% claimed by some liquid cooling solution providers.Fan power is only 25% of the server power when the server inlet temperature reaches 30-32⁰C.
At this stage of development ‘blade’ solutions are not possible for on-chip cooling or encapsulation cooling and problematical for immersion cooling.Having said that ‘blades’ are not dominating the market as expected and the ‘pizza-box’ commodity server is still alive and well
Most of the energy saving comparisons for liquid cooling are made against legacy chilled water applications and are , in fact, marginal when compared to modern chilled water (pPUE=1.12-1.14) and modern DX (pPUE=1.15-1.20) whilst actually being negative when compared to indirect evaporative air (pPUE=1.03-1.05).
On-chip cooling
This technology was developed and used (and popular) primarily for accelerating video processor speed in desktops used for high-end video gaming and CGI applications in the film industry. Removing processor heat directly at the chip surface enables the clock-rate (speed of switching) to be increased above the rate it was thermally designed for. All of the other heat-generating elements in the server (such as switched-mode power supplies, memory and hard-drives) still need cooling by the server fans – and hence the need for ‘conventional’ data centre thermal management remains in place. The design intent has never been to export the high-grade heat other than to a air:air or liquid:air heat-exchanger directly adjacent to the mother-board. The increased speed does generate more heat and so we could argue that density is increased but it is in the order of 50W extra output in a 300W mother board. The liquid used is usually refrigerant-based and is either circulated by convection in some form of heat-pipe or actually pumped.
The benefits of the technology are limited to increased processing speed, particularly graphics cards, and a minor reduction in on-board fan power. For details of a typical solution provider go see www.microway.com but it has to be pointed out that (in conjunction with Passive Thermal Inc) they make energy comparisons to legacy chilled water 6⁰/12⁰C without aisle containment or free-cooling.
The limitations of the technology include:
You have to have an application (such as HPC or CGI) that can take advantage of the higher processor speed with utilisation above 85%.With average processor utilisation in the 10-40% range (including virtual machines) this excludes most ‘normal’ data centre applications.
The server manufacturer has to be asked if they support (i.e. warranty) the addition of the cooling ‘puck’
The cooling circuit has no redundancy - although if the liquid cooling fails the clock-rate can, in theory at least, be throttled back to the normal air-cooled state
You still need an air-cooling system to remove the rest of the heat, typically 60% of the server, but it should be rated for 100% in case the chip-cooling fails
The heat removed is high-grade at 70-80⁰C and could be exported for re-use. However there is one reason why heat-harvesting has not been developed – so far the applications have been in a few stand-alone machines in one location dedicated to HPC/CGI. However in larger data centres, where you might have 100s-1000s processors, the complexity and cooling system component count will probably make the solution impractical. Trying to re-use the waste heat will absorb more than the fan power saved in the power needed to pump the interface circuit and, should the heat load not be available, a 100% rated heat rejection coil will be required.
Caution: The pump power for taking the heat away from the primary heat-exchanger and the fan power for the heat-rejection to ambient is rarely included in the energy claims for liquid cooling.
Immersion cooling
The established product using this technology is Green Revolution Cooling (Austin TX, USA). Go see www.grcooling.com. It has a low installed base – usually ‘demonstration’ sites of one ‘cabinet’, including one in a HPC application in Switzerland (Swiss National Supercomputing, 50kW demo installed December 2010, but I don’t know the results).
The server ‘cabinet’ is arranged as a (42U) horizontally mounted waist-high tank which contains a non-conductive mineral oil very similar to a non-perfumed ‘baby-oil’. Each cabinet contains about 500 litres of oil and the servers are suspended from the horizontal 19” rack, fully immersed in the oil. Each tank can accommodate up to 100kW of heat generation but that power from 19” servers is hard to envisage in just 42U. Despite the 100kW capacity and 70⁰-80⁰C take-off temperature the oil hardly presents any fire risk as the ignition temperature is 700⁰C but there is no doubt that the environment is not one of ‘cleanliness’ normally associated with data centre ‘white space’.
The oil is circulated by pumps and the heated oil is fed through a heat exchanger transferring the heat into a chilled water circuit. Dual-circuit oil flow/return is possible (although not done as far as I have seen) with double pipe fittings, 2N paths into the tank and 2N dual-pumps etc. However a tank leak stops all computing and the tank area has to be bunded. The system is not concurrently maintainable by the definitions of the ‘Tier Classifications’ in their various guises.
The servers cannot be used without modification. The on-board server fans have to be removed (albeit with an advantageous slight drop in power draw) but spinning hard-disks cannot be used due to their vent hole.
Advantages are claimed for the technology:
Lower server power due to no on-board fans
The ability to increase the clock-rate and processing capacity
The ability to re-use the waste heat at a high-grade 70⁰C
Lower corrosion risk than contaminated air or fresh-air and, as a result, higher integrity solder joints
The disadvantages/limitations include:
Changing hardware is a messy business and it may slow-down the refresh-rate which is the greatest opportunity for power reduction
You have to have a heat load 24/7/365 that can absorb all of the waste heat to make the energy case viable and you have to have a 100% rated heat rejection system in case the ‘heat load’ reduces or fails
You have to have an application (such as HPC or CGI) that can take advantage of the higher processor speed, if that is planned
The server manufacturer has to support (i.e. warranty) the immersion in oil and this may limit the choice of hardware
The cooling circuit can have redundancy but this doubles the cost and space for the pumps and heat exchangers
You still need an air-cooling system to remove the heat from ICT hardware that is not suitable for immersion and to remove the direct heat from the tanks
Encapsulation cooling
The longest established product using this technology is made in the UK by Iceotope. Go see www.iceotope.com. They have a new funder in Schneider Electric and, to date, there is a relatively small installed base – usually ‘demonstration’ sites of one ‘cabinet’, such as in Universities, that were ‘donated’ for testing and field-trialling.
The server ‘cabinet’ is bespoke to Iceotope and has a backplane of non-drip fluid connections. Each server (although they refer to them as ‘blades’) is encapsulated in a plastic shell that has a heat exchanger plate on one side. The server is sealed into the plate heat exchanger and immersed in an engineered fluid (Novec by 3M).
That fluid uses convection to transfer the heat from the immersed mother board to the heat exchanger. The encapsulated server then ‘docks’ into the waiting flow/return non-drip fittings in the back plane of the cabinet. Each cabinet may have 50 sets of docking pipes. Another fluid (originally water but it appears that users didn’t like having water pipes and connectors inside the cabinet) is then pumped through the server back-plate and carries the heat to a heat-exchanger serving multiple cabinets. That centralised heat-exchanger interfaces with the ‘outside’ – either rejecting via dry-coolers or into the building’s hot-water system so as to re-use the heat. Each cabinet accommodates less absorbed power than the equivalent air-cooled version as the encapsulation takes up width and height. The 3M Novec fluid and the secondary circuit fluid are non-conducting and present no fire risk, in fact Novec is used as a fire-suppressant. Dual-circuit flow/return is not possible and this is the biggest drawback. Redundant pumps are used but the single connection to each server is the weakest point. The encapsulation and the secondary circuit limit the spillage to minor volumes in the event of a leakage from a fitting, pipe or docking valve.
The servers cannot be used without modification. The fans have to be removed (albeit with an advantageous drop in power) and spinning hard-disks cannot be used due to their vent hole. Until recently only servers could be encapsulated and there was no solution for storage or I/O but the injection of venture capital was intended to extend the technology and may have done so by now. It is marketed solely as an HPC application, especially for increased processing clock-rates, although, surprisingly, the capability to re-use the waste heat is not heavily stressed.
Advantages are claimed for the technology:
Higher density per cabinet
Lower server power due to no on-board fans
The ability to increase the clock-rate and processing capacity
The ability to re-use the waste heat at a high-grade 70⁰C
Lower corrosion risk than contaminated air or fresh-air and higher integrity solder joints
The disadvantages/limitations include:
Fluids in the racks, especially with many-multiple non-drip valves, is not popular in the industry, even if it is not water
There is only a single cooling pathway to each server – Tier II at best
There is no choice of IT cabinet (it has to come from Iceotope fully fitted with docking plane and plumbed) and a slightly limited choice of server OEMs that support the encapsulation without infringing their warranty
Higher density per cabinet is very questionable as they are air-cooled servers that have been encapsulated, not specially designed for encapsulation
The reduction in fan power of the server is usually overstated at 25% compared to the more usual 6-8%
Changing/repairing/refreshing hardware is a task that has to be carried out by the OEM and each encapsulation product is suitable for only one particular server form-factor/shape.This will tend to encourage a slowing-down of the ICT refresh-rate which is the greatest opportunity for power reduction every 18-24 months.The on-going cost of encapsulation is not transparent
You have to have a heat load 24/7/365 that can absorb all of the waste heat to make the energy case viable and you have to have a 100% rated heat rejection system in case the ‘heat load’ reduces or fails
You have to have an application (such as HPC or CGI) that can take advantage of the higher processor speed, if that is planned
The server manufacturer has to support (i.e. warranty) the encapsulation in Novec and this may limit the choice of hardware
You still need an air-cooling system to remove the heat from ICT hardware that is not suitable for encapsulation (Storage, disks, tapes and I/O) and to remove any direct heat from the cabinets
Conclusions
Liquid cooling (on- chip, immersion and encapsulation) is a relatively mature technology that has found very low take-up in the market place and so no real field experience exists. It is more suited to HPC applications than the main data centre and telecoms market since the opportunities to increase processing speed for HPC in research and universities are attractive.
There is no energy case for on-chip cooling and the reduced energy case for encapsulation and immersion is generally overstated by comparing the technology to data centre cooling from the 1990s. At most it can be a few percentage points provided by the removal of the server fans. Modern cooling can provide a pPUE as low as 1.03-1.04 which is lower than that provided by liquid cooling with the heat rejection circuit included - if that is the only required operational goal.
Re-use of waste heat is a secondary issue and limited by the fact that the heat load must not only exist 24/7/365 but be able to absorb partial output from the HPC installation when the utilisation is low. In every case a fully rated heat rejection system must be installed. If the waste heat is not used the energy case breaks down. If the waste heat is only partially used the RoI on the additional plant required is low.
Redundancy for encapsulation and immersion is difficult to achieve above a classification of Tier II and concurrent maintainability is not possible (which, admittedly, actually matches the requirements of many HPC workloads). For encapsulation the complexity and high component count makes the potential reliability lower than a conventional chilled water system.
Technical service and support (although limited in scope to micro-controllers, fittings/valves, pipes, pumps and heat-exchangers) is not widely available from the OEMs and on-site staff will have to be trained and have access to on-site spare parts.
Liquid cooling and the Smart-City/Smart-Grid
There would appear to be a good case for smaller embedded data centres that can export their heat directly into the shared facilities that they occupy. In the range of 100-200kW there are many buildings that can absorb the heat 24/7/365. The PUE for these smaller facilities will probably not be optimised as they will have to include the energy transfer load but, in contrast, large PUE optimised out-of-town/remote locations are unlikely to find the heat load within reach. So a compromise is envisaged where a little PUE could be sacrificed to achieve a very high reuse of waste heat – probably the Holy Grail of data centre energy effectiveness.
Prof Ian F Bitterlin
CEng PhD BSc(Hons) BA DipDesInn
FIET MCIBSE MBCS
Consulting Engineer & Visiting Professor, University of Leeds
Critical Facilities Consulting in partnership with DLE Consulting
email: bitterlin@criticalfacilities.co.uk