Forget PUE and Dramatically Reduce Energy Consumption?
The term data centre ‘efficiency’ is a misuse of the basic physical definition of ‘energy OUT divided by energy IN’. With very few exceptions, probably less than 1% and in many countries far less, data centres can better be described as ‘zero efficient’ since we know what goes ‘IN’ (mainly kWh electricity plus a little diesel fuel and embedded kWh and in utility water) and there is no energy coming ‘OUT’ other than waste heat. They are very good fan-heaters but unfortunately outdoors. The use of the term ‘effectiveness’ is more relevant as relates the consumption of resources to a notion of the output (ICT services) being somehow valuable and sacrosanct - which is why we have PUE, WUE and CUE et al where the E stands for Effectiveness, not Efficiency. However, we should not then immediately proceed to ignore the 1 in the ‘1 point something’; as in a PUE of 1.3 being regarded as measure of goodness and ignoring that the 1 (the ICT load) might make the facility a considerable waste of energy.
In fact, if you are cynical by nature, it is not much of a stretch to describe the original innovation of PUE as an initiative to take the focus away from the energy effectiveness of the ICT hardware at the time and put it onto the M&E infrastructure.
To be clear, I am not saying that PUE is not extremely useful, especially if it is related to the users’ business appetite for risk, rather that it should be based on the lowest ICT load that is needed to provide the IT service for the application. In terms of the three steps of classical sustainability are, in strict order, reduction, optimisation and then (and only then) use renewable energy. PUE is the second step, optimisation, with far too much effort being currently directed to the renewable-power issue which, results in a waste of a valuable resource.
If we ignore HPC, many mainframe loads, cloud, hyperscale search and social networking sites, the typical enterprise and collocation data centre (which perhaps counts for 80% of the global data centre estate) load is characterised by low utilisation and often with idling power consumption at far too high a percentage of full ICT load power. In other words, a high proportion of data centres spend most of the time idling and, when they do, burn a disproportionate amount of power.
Whilst many best-practice design and operational guides include some nods to the load, such as the Green Grid’s DCMM and the EU CoC, the amount of effort directed is not publicly aired much, if ever. We have, rightly, concentrated on the thermal management with raising temperature to maximise free-cooling and addressing air-flow with blanking plates and hole stopping – all with huge impact to the PUE and improvement easily monitored and reported.
So how can we focus on the 1 in the ‘1 point something’? One way could involve the power and IT performance data listed on the SPEC web-site.
SPEC (Standard Performance Evaluation Corporation, go see www.spec.org/power_ssj2008) is a benchmarking suite measuring the power and performance characteristics of server-class computer equipment. It is used to compare power and performance among different servers and serves as a toolset for use in improving server efficiency. This benchmark, now over 10 years old, is targeted for use by hardware vendors, IT industry, computer manufacturers, and governments. The test benchmark software was the first industry-standard benchmark that evaluates the power and performance characteristics of servers. The drive to create the power and performance benchmark came from the recognition that the IT industry, computer manufacturers, and governments are increasingly concerned with the energy use of servers.
The workload exercises the CPUs, caches, memory hierarchy and the scalability of shared memory processors (SMPs) as well as the implementations of the JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collection, threads and some aspects of the operating system.
Some people don’t think that the Spec_Power test routine doesn’t reflect real loads, and that is not unreasonable, but all the server OEMs submit their performance data by model and processor type and the website is publicly open and free to access for anyone interested – so why not use it to compare servers as a first step in server evaluation? If the server OEMs don’t want to be benchmarked, then they have the choice not to list their products.
There are other benchmarking systems, e.g. SERT, LinPac etc, but whether or not ICT hardware purchasers use performance as an energy criterion is an open question. One element of the benchmark test is to measure the idle-power, a concept which contributes to the ISO 30134 metric ITEE. However, I have been criticised before (from sources that have somewhat of a vested interest) in suggesting that idle-power should be a specification consideration. I do understand where they are coming from – idling should be avoided as a waste of resources – but the fact is that far too many servers idle in the real world. Some sources, including Intel not so long ago, have suggested that utilisation is commonly 10%, so the idle power is, clearly, important and a realistic problem. Maybe this argument would be resolved if all our ICT needs are migrated to heavily utilised ‘cloud’ data centres whose utilisation would exceed 60%, but that is likely to be some way off and uptime, latency, security and national boundaries are still big issues to be taken into account.
In 2014 a White Paper from a very respected source predicted that the lower limit for idle-power using silicon-based microprocessors was going to be 23% but, by 2017, that had been superseded by one ‘best-in-class’ model that idled at 13%. However, that belies the fact that the highest idle power still listed is 79% - a considerable number if contrasted to the efforts to reduce a PUE for 1.6 to 1.5. But things are going in the right direction: Between 2014 and 2017 the average (of several hundred listed servers) idle power had declined from 41% to 35%.
Other metrics listed for each server are ‘Peak Watts’ at full speed (something very interesting if compared to name-plate data), ‘Operations per Watt’, the concept of which contributes to the new ISO 30134 metric ITEU, and ‘Operations’ at full speed. Put together, all of the metrics describe everything you would need to know to compare one server with another, except U-height and cost, always assuming that you appreciate that the SPEC test may not precisely represent your real load.
Do you remember the time when your data centre load was nearly constant 24/7 even though your IT load wasn’t? Or does that still describe your experience? Could it be probably due to idling power being very high in relation to your peak? Or what about the common experience of server name-plate power data being more than double the reality when in service? That could also be explained by idling.
So, let’s compare two examples from the 2018 SPEC listings. I wont name the OEMs (but you can look for yourself) because it is the concept that matters more and the fact that servers are developing rapidly.
As they say, comparisons are odious: Model B consumes 17% less power at 100% load, idles at 13% power instead of 57%, operates 150x faster and does so at 270x more Operations/Watt. So, if you utilise it with virtualisation Model B could replace hundreds of Model A with a power consumed of <1%, i.e. a 99% reduction. Alternatively, if you indulged in bad practice, you could idle Model B at 20% of the power consumption of Model A. Any CapEx difference, if there is any, which I doubt, would be swamped by the huge savings in the cost of energy and in data centre construction.
Should we care about PUE? Yes, absolutely, but only after we take the fist step of reducing the load…