I want to thank vExpert and Canada’s biggest HPE enthusiast and top expert, Stephen Wagner, for contributing to this troubleshooting guide.
If your Server randomly shuts downs or makes a restart. You check ILO logs, and you see something like
05:48 1 Server power removed.
05:49 1 Server reset.
05:49 1 Embedded Flash/SD-Card: Restarted,
05:49 1 Server reset.
05:49 2 Server power restored
or
07:00 1 Server power restored.
07:01 1 Sever power removed.
07:02 1 Server Reset.
07:02 1 Server power restored.
07:03 1 Sever Power removed.
07:04 1 Server reset.
or
04:23:07 Server power restored
04:23:07 Server reset.
04:22:06 Server power removed.
04:22:58 Embedded Flash. Restarted.
04:21:56 Server reset.
Or something similar behavior. You see a similar pattern…
You suspect a hardware failure. PSU or PDU failure, loose cable, BIOS crash, unplugged, power fluctuation, and so on….and you know your data center did not have any power issues, UPS logs are showing fine. All other servers are fine in the same rack.
To be honest, the only way to troubleshoot this:
- Confirm the server is able to run on one PSU.
- If it can run on one PSU for 20+ minutes and high load, then it’s most likely a faulty mainboard or buggy firmware.
- If it cannot run on one PSU, then the load is too high for 1 PSU, and you most likely also have a faulty PSU or mainboard.
- For extra information, reference the iLO and IML logs to see when the last firmware updates were complete, and try to talk to the person who did them to find out if they were doing this to solve a specific problem.
- Check notes on the firmware versions of both ROMs (old and new) to see if there’s known issues with power/resets.
- Check notes on ILO version to see if there are known issues.