High fan speed in lake1 after powering it back #39

Closed
opened 2023-09-14 11:56:19 +02:00 by rarias · 10 comments
rarias commented 2023-09-14 11:56:19 +02:00 (Migrated from pm.bsc.es)

After taking the burn voltage regulator and it replacing by the new one (see #22), the node boots and seems to be stable but is drawing a higher amount of power than other nodes (240W rather than around 120W) and the fans spin at 20.000 RPM (the maximum). The temperature seems to be fine as read from all the sensors in the board, so this power consumption could be just related to the high fan speed.

I continue to monitor closely the node, in case of any temperature or power surge.

I'm trying to find a way to make the fans go back to normal speed, but not successful yet.

After taking the burn voltage regulator and it replacing by the new one (see #22), the node boots and seems to be stable but is drawing a higher amount of power than other nodes (240W rather than around 120W) and the fans spin at 20.000 RPM (the maximum). The temperature seems to be fine as read from all the sensors in the board, so this power consumption could be just related to the high fan speed. I continue to monitor closely the node, in case of any temperature or power surge. I'm trying to find a way to make the fans go back to normal speed, but not successful yet.
rarias commented 2023-09-14 11:57:55 +02:00 (Migrated from pm.bsc.es)

changed the description

changed the description
rarias commented 2023-09-14 12:00:32 +02:00 (Migrated from pm.bsc.es)

mentioned in issue #22

mentioned in issue #22
rarias commented 2023-09-14 12:19:06 +02:00 (Migrated from pm.bsc.es)

From https://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/updating_frusdr_on_epsd_server.pdf :

2.1 Updating FRUSDR

  • Are your servers too noisy?
  • Do the system fans run at higher-than-normal speeds without any reason?
  • Is the system status LED not glowing steady green on the front panel of your server?

If your answer is yes, then you may have an “unhealthy” server. One or more components might
not be configured properly in your server or they might need replacement. Identify faulty
components if any and replace them. On reboot, if your server is still noisy, then you need a
FRUSDR update. The system might have been updated with FRUSDR for the wrong
configuration. Making the required changes will reduce the energy consumption and improve
acoustics which will extend the life of server.

From https://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/updating_frusdr_on_epsd_server.pdf : > ## 2.1 Updating FRUSDR > - Are your servers too noisy? > - Do the system fans run at higher-than-normal speeds without any reason? > - Is the system status LED not glowing steady green on the front panel of your server? > > If your answer is yes, then you may have an “unhealthy” server. One or more components might not be configured properly in your server or they might need replacement. Identify faulty components if any and replace them. On reboot, if your server is still noisy, then you need a FRUSDR update. The system might have been updated with FRUSDR for the wrong configuration. Making the required changes will reduce the energy consumption and improve acoustics which will extend the life of server.
rarias commented 2023-09-19 18:08:04 +02:00 (Migrated from pm.bsc.es)
More possible causes: https://www.intel.com/content/www/us/en/support/articles/000036464/server-products/server-boards.html
rarias commented 2023-09-21 17:22:04 +02:00 (Migrated from pm.bsc.es)

From the photos, it seems that the power supply was originally in the second slot, but is now in the first one:

oss01:~ # ipmitool sdr list fru
Baseboard        | Log FRU @00h 07.1 | ok
Pwr Supply 1 FRU | Log FRU @02h 0a.1 | ok
Pwr Supply 2 FRU | Log FRU @03h 0a.2 | ok
Front Panel      | Log FRU @04h 0c.1 | ok
HS Backplane 1   | Log FRU @05h 0f.1 | ok
PCIe SSD AIC 1   | Log FRU @12h 0b.1 | ok

hut% sudo ipmitool sdr list fru
Baseboard        | Log FRU @00h 07.1 | ok
Pwr Supply 1 FRU | Log FRU @02h 0a.1 | ok
Front Panel      | Log FRU @04h 0c.1 | ok
HS Backplane 1   | Log FRU @05h 0f.1 | ok

Removing the two DIMM donor modules for eudy may also have affected.

From the photos, it seems that the power supply was originally in the second slot, but is now in the first one: ``` oss01:~ # ipmitool sdr list fru Baseboard | Log FRU @00h 07.1 | ok Pwr Supply 1 FRU | Log FRU @02h 0a.1 | ok Pwr Supply 2 FRU | Log FRU @03h 0a.2 | ok Front Panel | Log FRU @04h 0c.1 | ok HS Backplane 1 | Log FRU @05h 0f.1 | ok PCIe SSD AIC 1 | Log FRU @12h 0b.1 | ok hut% sudo ipmitool sdr list fru Baseboard | Log FRU @00h 07.1 | ok Pwr Supply 1 FRU | Log FRU @02h 0a.1 | ok Front Panel | Log FRU @04h 0c.1 | ok HS Backplane 1 | Log FRU @05h 0f.1 | ok ``` Removing the two DIMM donor modules for eudy may also have affected.
rarias commented 2023-09-21 17:24:17 +02:00 (Migrated from pm.bsc.es)

The PS2 doesn't show in hut:

oss01:~ # ipmitool sdr | grep PS
PS1 Status       | 0x00              | ok
PS2 Status       | 0x00              | ok
PS1 Input Power  | 232 Watts         | ok
PS2 Input Power  | no reading        | ns
PS1 Curr Out %   | 27 percent        | ok
PS2 Curr Out %   | no reading        | ns
PS1 Temperature  | 26 degrees C      | ok
PS2 Temperature  | no reading        | ns
PS1 Fan Fail     | 0x00              | ok
PS2 Fan Fail     | Not Readable      | ns

hut% sudo ipmitool sdr | grep PS
PS1 Status       | 0x00              | ok
PS1 Input Power  | 112 Watts         | ok
PS1 Curr Out %   | 12 percent        | ok
PS1 Temperature  | 25 degrees C      | ok
PS1 Fan Fail     | 0x00              | ok
The PS2 doesn't show in hut: ``` oss01:~ # ipmitool sdr | grep PS PS1 Status | 0x00 | ok PS2 Status | 0x00 | ok PS1 Input Power | 232 Watts | ok PS2 Input Power | no reading | ns PS1 Curr Out % | 27 percent | ok PS2 Curr Out % | no reading | ns PS1 Temperature | 26 degrees C | ok PS2 Temperature | no reading | ns PS1 Fan Fail | 0x00 | ok PS2 Fan Fail | Not Readable | ns hut% sudo ipmitool sdr | grep PS PS1 Status | 0x00 | ok PS1 Input Power | 112 Watts | ok PS1 Curr Out % | 12 percent | ok PS1 Temperature | 25 degrees C | ok PS1 Fan Fail | 0x00 | ok ```
rarias commented 2023-09-26 16:23:53 +02:00 (Migrated from pm.bsc.es)

I did the following tests:

  1. I removed the PS2 and installed in the PS1 socket. Then I turned the node on but the fans were still at 20000 rpm. Also, the PS1 sdr info shows the PS1 Fan Fail as "ns".
  2. I left it in PS1 and added another PS from owl1 into PS2. Then, when booting I saw the fans running at the correct speed of 7000 rpm.

As I suspected, the BMC has seen two PS in the PS1 and PS2 sockets, but it only detects one. To change this information I need to perform an FRU update.

I did the following tests: 1. I removed the PS2 and installed in the PS1 socket. Then I turned the node on but the fans were still at 20000 rpm. Also, the PS1 sdr info shows the PS1 Fan Fail as "ns". 2. I left it in PS1 and added another PS from owl1 into PS2. Then, when booting I saw the fans running at the correct speed of 7000 rpm. As I suspected, the BMC has seen two PS in the PS1 and PS2 sockets, but it only detects one. To change this information I need to perform an FRU update.
rarias commented 2023-09-26 16:32:08 +02:00 (Migrated from pm.bsc.es)

I moved the PS2 back to owl1 and then I swapped the PS1 back to the PS2 socket.

As a side effect, owl1 now can properly read the power consumption 🤷

I moved the PS2 back to owl1 and then I swapped the PS1 back to the PS2 socket. As a side effect, owl1 now can properly read the power consumption :shrug:
rarias commented 2023-09-26 16:55:11 +02:00 (Migrated from pm.bsc.es)

After accessing the lake1 BMC control web interface, under the Configuration > SDR Configuration page, with the "Enable SDR Auto-configuration" setting set as "Enabled" and clicking the "Save" button and then "Parse", I managed to make the BMC re-scan the hardware and only detect the PS1.

Now the fans are running at 2500 rpm and the power consumption has dropped to around 100 W. Here is the info from ipmitool:

hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep -i fan
Fan Redundancy   | 0x00              | ok
System Fan 1A    | 2408 RPM          | ok
System Fan 1B    | 2490 RPM          | ok
System Fan 2A    | 2408 RPM          | ok
System Fan 2B    | 2490 RPM          | ok
System Fan 3A    | 2580 RPM          | ok
System Fan 3B    | 2407 RPM          | ok
System Fan 4A    | 2408 RPM          | ok
System Fan 4B    | 2490 RPM          | ok
System Fan 5A    | 2494 RPM          | ok
System Fan 5B    | 2490 RPM          | ok
System Fan 6A    | 2494 RPM          | ok
System Fan 6B    | 2407 RPM          | ok
PS1 Fan Fail     | 0x00              | ok

hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep PS
PS1 Status       | 0x00              | ok
PS1 Input Power  | 112 Watts         | ok
PS1 Curr Out %   | 12 percent        | ok
PS1 Temperature  | 28 degrees C      | ok
PS1 Fan Fail     | 0x00              | ok

The PS2 info is now gone.

I will reboot the node and check that this still holds when it boots. If so, this issue can be considered solved.

After accessing the lake1 BMC control web interface, under the Configuration > SDR Configuration page, with the "Enable SDR Auto-configuration" setting set as "Enabled" and clicking the "Save" button and then "Parse", I managed to make the BMC re-scan the hardware and only detect the PS1. Now the fans are running at 2500 rpm and the power consumption has dropped to around 100 W. Here is the info from ipmitool: ``` hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep -i fan Fan Redundancy | 0x00 | ok System Fan 1A | 2408 RPM | ok System Fan 1B | 2490 RPM | ok System Fan 2A | 2408 RPM | ok System Fan 2B | 2490 RPM | ok System Fan 3A | 2580 RPM | ok System Fan 3B | 2407 RPM | ok System Fan 4A | 2408 RPM | ok System Fan 4B | 2490 RPM | ok System Fan 5A | 2494 RPM | ok System Fan 5B | 2490 RPM | ok System Fan 6A | 2494 RPM | ok System Fan 6B | 2407 RPM | ok PS1 Fan Fail | 0x00 | ok hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep PS PS1 Status | 0x00 | ok PS1 Input Power | 112 Watts | ok PS1 Curr Out % | 12 percent | ok PS1 Temperature | 28 degrees C | ok PS1 Fan Fail | 0x00 | ok ``` The PS2 info is now gone. I will reboot the node and check that this still holds when it boots. If so, this issue can be considered solved.
rarias commented 2023-09-26 17:08:42 +02:00 (Migrated from pm.bsc.es)

Fan went up to 8000 rpm, as the node airflow increased, but it remains at a reasonable speed:

hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep -i fan
Fan Redundancy   | 0x00              | ok
System Fan 1A    | 7912 RPM          | ok
System Fan 1B    | 7968 RPM          | ok
System Fan 2A    | 8170 RPM          | ok
System Fan 2B    | 7885 RPM          | ok
System Fan 3A    | 8084 RPM          | ok
System Fan 3B    | 7802 RPM          | ok
System Fan 4A    | 8084 RPM          | ok
System Fan 4B    | 7968 RPM          | ok
System Fan 5A    | 7998 RPM          | ok
System Fan 5B    | 7968 RPM          | ok
System Fan 6A    | 8170 RPM          | ok
System Fan 6B    | 7885 RPM          | ok
PS1 Fan Fail     | 0x00              | ok

hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep PS
PS1 Status       | 0x00              | ok
PS1 Input Power  | 120 Watts         | ok
PS1 Curr Out %   | 14 percent        | ok
PS1 Temperature  | 28 degrees C      | ok
PS1 Fan Fail     | 0x00              | ok

Fixed.

Fan went up to 8000 rpm, as the node airflow increased, but it remains at a reasonable speed: ``` hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep -i fan Fan Redundancy | 0x00 | ok System Fan 1A | 7912 RPM | ok System Fan 1B | 7968 RPM | ok System Fan 2A | 8170 RPM | ok System Fan 2B | 7885 RPM | ok System Fan 3A | 8084 RPM | ok System Fan 3B | 7802 RPM | ok System Fan 4A | 8084 RPM | ok System Fan 4B | 7968 RPM | ok System Fan 5A | 7998 RPM | ok System Fan 5B | 7968 RPM | ok System Fan 6A | 8170 RPM | ok System Fan 6B | 7885 RPM | ok PS1 Fan Fail | 0x00 | ok hut% ipmitool -I lanplus -H oss01-ipmi0 -P "" -U "" sdr list | grep PS PS1 Status | 0x00 | ok PS1 Input Power | 120 Watts | ok PS1 Curr Out % | 14 percent | ok PS1 Temperature | 28 degrees C | ok PS1 Fan Fail | 0x00 | ok ``` Fixed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#39