Building a Low Power Consumption Server - Part II

TL;DR: In the first part of this blog series, we have talked about the hardware configuration. In the second part, we will focus on the software/BIOS improvements I made to reach a system that consumes 15.5 14.5 watts at the wall in idle. Short recap about the hardware: 128 GB DDR4 (non-ECC) memory, 6 cores and 12 threads, 2x SSDs/1x NVMe/7x fans, a 10 Gigabit network + IPMI, all in a 1U server case.

Low power: Why it matters?

Some that what not really covered in my first blog post of this series was the question: Why we should put some effort into it to build a system with low-power consumption?. First the obvious answer: money - as the prices for power here in Germany are not the cheapest one [1] - reducing the power is one way to save money on electricity. Furthermore, less power consumption directly means less heat which results in quieter fans, lower fan RPMs and simpler airflow requirements (Low-power consumption also means reducing of the CO₂ footprint.). Especially in small home labs or rack setups in living space. Beside the fact lower-power reduce temperatures and noise it also reduced electrical stress in general. That could lead to longer lifespan for fans and power supplies.

Where should we start?

I would say, before we jump to the operating system layer we will start on the Basic Input Output System (BIOS) settings. As this is the interface between the software and hardware layer; this should be the first point we can influence whether a system is able to go down into a sleeper state or not. The main settings that are relevant for this purpose are related to ASPM and C-States.

Let's talk about C-States

There are several different ways at a CPU level to control the power usage. For example, Power Performance States (P-States), Time States (T-States), and Processor Operating States aka C-States. The latter is the state we are interested in as the state reflects the capability to turn off unused components when the CPU is idling. The goal is to save power. Since servers often spend many times in an idling state, waiting for a new task, getting a higher C-State level (C0-C10) is important.

C-StateCore ClocksCore VoltageL1/L2 CacheArchitectural StateWake LatencyPower Impact
C0OnNominalActiveActive~0 nsNone
C1GatedNominalRetainedRetainedVery lowVery low
C1EGatedReducedRetainedRetainedLowLow
C3OffReducedFlushedRetainedµs rangeMedium
C6OffCollapsedFlushedSaved to LLCHigher µsHigh
C7OffCollapsedFlushedSaved + deeper gatingHigherVery high
C8/C9/C10OffCollapsedFlushedMinimal retentionHighestMaximum

As a higher level allows the CPU to go in a deeper sleep mode, which means the deeper the state, the less power the CPU can consume, easy right [2] ? Of course, going into a deeper sleep mode also means that the time to wake-up will be increased. During the change of the different levels more circuits and signals will be turned off. In the past [3] this could lead to unexpected freezes and reboots, but this should not a problem nowadays. Now that we have talk about C-States let's go to tweak the BIOS!

Say the line, Bart! What the hell is ASPM?

Before we dive into BIOS tweaks, we need to talk about Active State Power Management (ASPM). Both, ASPM and C-States are necessary for decreasing power consumption (In other words: If you ignore ASPM, you might never reach deep idle states no matter how good you tuned your C-States.). Unlike C-States which reduce CPU core/package power, ASPM controls power consumption of PCIe devices [4] . The main purpose of ASPM is to reduce PCIe link power consumption between:

Well, why is ASPM important? Even the CPU can reach a deep C-State the system cannot reach a very low idle power as long as PCIe links stay fully activated. This is where ASPM comes into play. Similar to C-States we have different levels (L0-L1.2) for different purposes [5] . A short summary can be found below:

StateDescriptionLatencyPower Saving
L0Fully activeNoneNone
L0sStandby-likeVery lowSmall
L1Deeper idleModerateSignificant
L1.1Substate of L1HigherMore
L1.2Deep substateHighestMaximum

The best case, would be L1.2; even I have put a lot of effort to find hardware that supports APSM L1.2, most of the PCIe devices I've put into my motherboard (Supermicro X12SCZ-F) only supports L1 as seen below:

➜  ~ lspci -vv | awk '/ASPM/{print $0}' RS= | grep -P '(^[a-z0-9:.]+|ASPM )'
00:1b.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #17 (rev f0) (prog-if 00 [Normal decode])
                LnkCap: Port #17, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
00:1c.0 PCI bridge: Intel Corporation Comet Lake PCIe Root Port #1 (rev f0) (prog-if 00 [Normal decode])
                LnkCap: Port #1, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
00:1c.5 PCI bridge: Intel Corporation Comet Lake PCIe Port #6 (rev f0) (prog-if 00 [Normal decode])
                LnkCap: Port #6, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <16us
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
00:1c.6 PCI bridge: Intel Corporation Comet Lake PCIe Root Port #7 (rev f0) (prog-if 00 [Normal decode])
                LnkCap: Port #7, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <16us
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 (prog-if 02 [NVM Express])
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
04:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04) (prog-if 00 [Normal decode])
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <2us
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
➜  ~

Worth noting, only the Intel X710-DA2 (02:00:0 and 02:00.1) and the NVMe Samsung PM981a (01:00:0) are extra devices the remaining listed devices are onboard. As described in the first part in this part we will talk about the changes I did to achieve 15.5 watts (In the end, with all settings and optimizations it was even 14.5 watts.) in idle. A short recap: C-States and ASPM works hand in hand; If an active PCIe link is enabled the CPU can never go into a deeper state due to the interruption from the active link. Also mentioned in my last post, it is also important how the PCIe slot is connected from the hardware perspective. In general, there are two types:

A) the PCIe bus is directly connected to the CPU which usually gave us a shorter physical and logical path to the CPU. This results in better performance because there is no bottleneck (the chipset) in place; and

B) the PCIe bus is connected to a chipset (Platform Controller Hub (PCH)) (in the case of the Supermicro X12SCZ-F an Intel W480E).

The chipset acts as a "hub" between the PCIe slot and the CPU. Since CPUs only have a limited number of lanes to talk to the hardware the PCH will take some of these lanes and split these in multiple slower connection.

The motherboard Supermicro X12SCZ-F has three PCIe slots [6] ; PCIe x4 (SLOT4), PCIe x16 (SLOT6), and PCIe x4 (SLOT7). As seen in the figure above, SLOT4 is connected to the PCH and SLOT6 is directly connected to the CPU. During the tests and optimizations of the system I have figured out that my network card Intel X710-DA2 on SLOT6 (the one that's connected to the CPU) hinder the CPU to go deeper than C8. By adding the card to SLOT4 the CPU was still able to go C10. Although the slot has lower speed specification (x4) and the Intel X710-DA2 has x8 as specification, this had no impact on the performance of the network card. In addition to that, adding the card in SLOT4 had also the advantage that neither the NVMe nor the network card is covered. This has positive effects on the cooling of the parts.

BIOS tweaks

What we want, is to allow the CPU to go into a high sleep state or in other words: C8 or even C10. For this purpose, we have to enable ASPM on the software side for every device that supports ASPM. Furthermore, we have to enable the capability within the BIOS. Same for C-States related settings. The first thing that I did was to update my BIOS version; my board came with a very old version, and I've decided to update it (This step was quite easy: download the new BIOS [7] add the binary file to a USB stick, boot into an UEFI Shell and follow the instruction in the README file.) I also updated the Baseboard Management Controller (BMC) after the BIOS update.

After I did this on my three nodes I've started to tweak the BIOS settings. First of all, I enabled all C-States features within the CPU Configuration as seen below:

Furthermore, I've decided to change the PL1 (Long Duration Power Limit) to 35W and the PL2 (Short Duration Power Limit) value to 45W. This results in a Thermal Design Power (TDP) of 35W which is the same as the Intel i5-10400T [8] have. Since the system is sitting in a 1U server case the advantages are less heat and fan noise while keeping the CPU in a more efficient performance-per-watt range. This in combination with my fan settings and the custom air shroud results in a maximum of 55-56°C under load.

In addition to the CPU settings I've enabled ASPM on every PCIe device, disabled the PCIe x16 (SLOT6) as this PCIe slot is directly connected to the CPU (see above). On the ACPI Settings I also enabled the Native PCIE and Native ASPM support. These settings allow controlling ASPM by the operating system rather the BIOS.

Hidden BIOS Settings?

For unknown reason some features cannot be disabled through the BIOS settings. For example, the board comes with an audio chipset which is of course useless in a home server context, the board also comes with two LAN ports (Intel I210 and Intel I219-LM) (The I219-LM chipset is supported by the e1000e kernel driver which stop my setup to go deeper than C8.). As my setup includes an Intel X710-DA2 I only needed the Intel I210 device for the stage 1 phase where I decrypt my root partition. I can also use the 10 gig network card for this purpose, but I want separated physical network for reason. The Intel I210 device can also be used to access the system via IPMI. Lucky this chipset can be used by the Linux kernel driver igb which had-at least in my setup-good performance as well as power efficient features that don't hinder the CPU to reach C10.

It's very likely that the features can be disabled within the BIOS but at this point we cannot see the needed menu settings. At this point we have to go deeper.

To see the setup structure within the BIOS we have to extract the UEFI Internal Form Representation (IFR) binary data from the UEFI firmware. The firmware has different modules, one of them is the Setup module (As the motherboard is an AMI (Aptio) based one, we can use the GUID: 899407D7-99FE43-D89A21-79EC328CAC21 to find the Setup module.). This module contains the IFR and lists all questions/options along with their variable name VarStore, offset and allowed value. Even if some settings are hidden or grayed out in the visible BIOS UI with the help of the IFR we can still see those entries.

UEFI-Editor [9] is one tool that give us the opportunity to load our BIOS binary blob. After we're loading the BIOS bin file we can easily extract the Setup module. This approach is explained in the README of the UEFI-Editor project [10] . The IFRExtractor RS [11] allows us than to extract the IFR in a human friendly format:

➜  ~ nix-shell -p ifrextractor-rs nodejs
[...]
[nix-shell:~]# ifrextractor Section_PE32_image_Setup_Setup.sct verbose
Extracting all UEFI HII form packages using en-US UEFI HII string packages in verbose mode
[nix-shell:~]# node IFR-Formatter.js Section_PE32_image_Setup_Setup.sct.0.0.en-US.ifr.txt
[nix-shell:~]# cat Section_PE32_image_Setup_Setup.sct.0.0.en-US.ifr.txt
[...]

HD Audio | VarStore: PchSetup | VarOffset: 0x557 | Size: 0x1
    Disabled: 0x0
    Enabled: 0x1

PCH LAN Controller | VarStore: PchSetup | VarOffset: 0x9 | Size: 0x1
    Enabled: 0x1
    Disabled: 0x0

PCI Express Root Port 21 | VarStore: PchSetup | VarOffset: 0x10A | Size: 0x1
    Disabled: 0x0
    Enabled: 0x1

AMT BIOS Features | VarStore: MeSetup | VarOffset: 0x17 | Size: 0x1
    Disabled: 0x0
    Enabled: 0x1

ACPI D3Cold Support | VarStore: Setup | VarOffset: 0x442 | Size: 0x1
    Disabled: 0x0
    Enabled: 0x1

[...]
➜  ~

As assumed the IFR contains the data that we're looking for. The listing above shows the settings that I've enabled/disabled with help of the setup_var.efi tool [12] . This can be done by putting the setup_var.efi binary file on a FAT32 formatted USB drive; booting the system into an UEFI-Shell and write the new values:

FS0:\> setup_var.efi PchSetup:0x557=0x00
FS0:\> setup_var.efi PchSetup:0x09=0x00
FS0:\> setup_var.efi PchSetup:0x10A=0x00
FS0:\> setup_var.efi MeSetup:0x17=0x00
FS0:\> setup_var.efi Setup(0x01):0x442=0x01

These changes are permanent so keep in mind that if you disable necessary features it can happen that the system will not boot anymore, and you have to reset your CMOS. In my case, the changes worked as expected. The mentioned changes above will disable the audio device, the Intel I219-LM, the PCIe x4 (SLOT7) and Intel Active-Management-Technologie (AMT). The settings above will also enable ACPI D3Cold support which allows the OS to power-down devices when they're idle to reduce standby/idle power. After a reboot the audio, the network card as well as the serial controller that comes with the Intel AMT feature, are gone:

 00:14.0 USB controller: Intel Corporation Comet Lake USB 3.1 xHCI Host Controller
 00:14.2 RAM memory: Intel Corporation Comet Lake PCH Shared SRAM
 00:16.0 Communication controller: Intel Corporation Comet Lake HECI Controller
-00:16.3 Serial controller: Intel Corporation Comet Lake Keyboard and Text (KT) Redirection
 00:17.0 SATA controller: Intel Corporation Comet Lake SATA AHCI Controller
 00:1b.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #17 (rev f0)
 00:1c.0 PCI bridge: Intel Corporation Comet Lake PCIe Root Port #1 (rev f0)
 00:1c.5 PCI bridge: Intel Corporation Comet Lake PCIe Port #6 (rev f0)
 00:1c.6 PCI bridge: Intel Corporation Comet Lake PCIe Root Port #7 (rev f0)
 00:1f.0 ISA bridge: Intel Corporation W480 Chipset LPC/eSPI Controller
-00:1f.3 Audio device: Intel Corporation Comet Lake PCH cAVS
 00:1f.4 SMBus: Intel Corporation Comet Lake PCH SMBus Controller
 00:1f.5 Serial bus controller: Intel Corporation Comet Lake PCH SPI Controller
-00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (11) I219-LM
 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
 02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
 02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)

Now that we have spoken about C-States, ASPM and the BIOS settings we can jump to the software side where we also can optimize things. If you are interested in the full list of the format IFR data, you can find the out for the specific motherboard here.

BIOS tweaks done: Now comes Software

After we spend some time within the BIOS tweaks as well as talk about the theory part we will jump straight to the software part. For this project I've decided to use NixOS [13] as my main OS. NixOS is a Linux distribution built on the Nix package manager. Different to other Linux distributions you don't install your software manually instead you describe your system state within a config file. In short: "One config to rule them all" (You also can split this into multiple files for better reading/organization purposes; which many peoples do.). The advantages of NixOS are reliability through atomic updates and declarative cleanliness. This means in this context: Every system change creates its own version that allows to jump back to the previous stage if the current is broken or something else. Furthermore, it means only what is explicitly defined in your config file exists on the system.

The approach of atomic updates and declarative cleanliness helps me to define my entire cluster in code and I can deploy and provision the whole setup in minutes. I am going to talk about this topic more in depth in part three of this series where we will speak about the power consumption after the cluster is final stage.

Activates ASPM for all supported Devices

Since we have enabled ASPM support for the PCIe slots within the BIOS now we have to enable the ASPM support for each device on the software side. For this purpose, we can use the AutoASPM [14] Python script by Wolfgang. This script is an advanced version of the Python script by z8 [15] . Using the AutoASPM package also have the advantage (at least for me) to add the script including the service easily into my Nix configuration.

{ pkgs, ... }:
{
  # [...]
  services = {
    autoaspm.enable = true;
    # [...]
  };
}

However, the Python script can also be used without using NixOS without any problems. The script will use lspci to find devices that support ASPM (e.g. L0s, L0sL1 or L1) and activates it automatically for the specific device. I've modified the script a bit to see also the device that is effected by the change. The patch can be found here. An execution can be looks like this:

➜  ~ python3 ./autoaspm.py
00:1b.0: PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #17 (rev f0) - Enabled ASPM L1
00:1c.0: PCI bridge: Intel Corporation Comet Lake PCIe Root Port #1 (rev f0) - Enabled ASPM L1
00:1c.5: PCI bridge: Intel Corporation Comet Lake PCIe Port #6 (rev f0) - Enabled ASPM L0sL1
00:1c.6: PCI bridge: Intel Corporation Comet Lake PCIe Root Port #7 (rev f0) - Enabled ASPM L0sL1
01:00.0: Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 - Enabled ASPM L1
02:00.0: Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) - Enabled ASPM L1
02:00.1: Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) - Enabled ASPM L1
03:00.0: Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) - Enabled ASPM L0sL1
04:00.0: PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04) - Enabled ASPM L0sL1
➜  ~ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Comet Lake-S 6c Host Bridge/DRAM Controller
           +-08.0  Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
           +-12.0  Intel Corporation Comet Lake PCH Thermal Controller
           +-14.0  Intel Corporation Comet Lake USB 3.1 xHCI Host Controller
           +-14.2  Intel Corporation Comet Lake PCH Shared SRAM
           +-16.0  Intel Corporation Comet Lake HECI Controller
           +-17.0  Intel Corporation Comet Lake SATA AHCI Controller
           +-1b.0-[01]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
           +-1c.0-[02]--+-00.0  Intel Corporation Ethernet Controller X710 for 10GbE SFP+
           |            \-00.1  Intel Corporation Ethernet Controller X710 for 10GbE SFP+
           +-1c.5-[03]----00.0  Intel Corporation I210 Gigabit Network Connection
           +-1c.6-[04-05]----00.0-[05]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
           +-1f.0  Intel Corporation W480 Chipset LPC/eSPI Controller
           +-1f.4  Intel Corporation Comet Lake PCH SMBus Controller
           \-1f.5  Intel Corporation Comet Lake PCH SPI Controller
➜  ~

ASPM is now activated for several devices as seen above. As you can also see the Intel X710-DA2, and the NVMe SSD controller for my Samsung PM981a is set to L1. Also it can be seen that the network card and the NVMe controller sits on the root ports 00:1b:0 and 00:1c:0. This is the reason why autoaspm also shows these ports "Enabled".

Further improvements

In addition to the activation of the ASPM, we will use powertop [16] and the built-in function --auto-tune to optimize some settings automatically for example: enable link power management for different devices e.g. SATA. Using the --auto-tune function and fan settings (more on this later) allows us to reach C10 with ~94-97% in average. The result can be seen below:

The settings set by powertop are not permanent which means you have to start the command every time after a reboot. On NixOS this can also be easy enabled on start by using:

{ pkgs, ... }:
{
  # [...]
  powerManagement = {
    enable = true;
    powertop.enable = true;
  };
  # [...]
}

By enabling the optimizations, ASPM for all supported device and optimize the fan settings we reach 14.5 watts from the wall on idle. This is 1 watt better as mention in my last post. During the measure only a DAC cable was installed on the Intel X710-DA2.

Turning RPM into Watts: Why Fan Curves Matter

In the previous blog post I showed that my custom build has 7 fans installed; 5x ARCTIC S4028-6K and 2x Noctua NF-A4x20 PWM. When every fan is running with maximum RPM one Noctua will approximately consume 0.6W and one ARCTIC ~1.2W. Since we have 2x Noctua and 5x ARCTIC these are ~7-8W. First the has two drawbacks; of course the power consumption and the the noise that are generated through the rotation of the fans. Both should be to avoid. At this point I have had good experiences with the GitHub project smfc [17] by Peter Sulyok. The implemented Python script smfc use the IPMI tool ipmitool to setup the fans depends on the defined values and load.

Once again, the advantage of using NixOS helps me to create my own package that I can use to start smfc during the boot sequence. My used settings can be found below:

{ pkgs, ... }:
{
  # [...]
  services = {
    smfc = {
      enable = true;
      settings = {
        Ipmi = {
          fan_mode_delay = 10;
          fan_level_delay = 2;
        };
        "CPU zone" = {
          enabled = 1;
          ipmi_zone = "0 1";
          temp_calc = 1;
          steps = 6;
          sensitivity = "3.0";
          polling = 6;
          min_temp = "35.0";
          max_temp = "65.0";
          min_level = 20;
          max_level = 100;
        };
        "HD zone" = {
          ipmi_zone = 1;
          enabled = 0;
        };
        "GPU zone" = {
          enabled = 0;
          ipmi_zone = 1;
        };
      };
    };
    # [...]
  };
}

These settings and my custom air shroud let the fans spin around 840-960 RPM in 95% of the time. This results in 36-37°C and 55-56°C under load for each node in my rack stacked one above the other. And the best, the system running cool and consume less power.

Conclusion

With my final setup: the BIOS settings (hidden and visible), the enabled ASPM through the AutoASPM package, the powertop optimizations and the last but not least the fan settings; the system runs quiet by a power consumption of 14.5 watts on idle. Most of the time the system is on C10. Overall, I'm really happy with this result. As discussed on reddit [18] I am going to write a third part of this series where I will talk about the power consumption and settings when the cluster is running in production state. At the end I spend some time trying to figure out which settings have the best output, and often it were long nights and several reboots. Looking for resources I stumble again across the YouTube channel from Wolfgang where I found a great video that summaries the main steps [19] . I can highly recommend the video as well as the channel especially for beginner in this topic. That been said; I hope you enjoyed my deep dive and see you on the next post.

Resources

#EOF