Overview of Error Counters

The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric.

{{{ Example:

Switch ib:Spine 03:Port 19-ext 19:Lid 41:GUID 2c9020041ea98 / Switch ib:Line 19:Port 21-ext 21:Lid 5:GUID 2c9020041d268 - / PortRcvRemotePhysicalErrors:13 }}}

SymbolErrors

The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable.

{{{ Example:

ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - / SymbolErrors:65535 }}}

VL15Drop

VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.

{{{ Example:

Switch ib:Line 13:Port 24-ext 24:Lid 15:GUID 2c9020041d470 / Switch ib:Spine 06:Port 13-ext 13:Lid 42:GUID 2c9020041eae8 - / VL15Dropped:17 }}}

XmtDiscards

This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient.

{{{ Example:

Oct 26 00:33:52 ce-master ibmon[5694]: ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - XmtDiscards:2 / }}}

LessThanOptLink

Link is not performing at its optimal speed. To correct, look at what the connections are. If you are able to get to the switch and enable/disable ports, reset ports to see if that corrects issue. If problem still remains, reseat cable. Ultimately if none of the solutions work, a spine/card reseat may be needed.

{{{ Example:

Aug 31 12:25:22 mu-master ibmon2[17708]: Connection: ibcore2 Line 23 Port [11][ext 11] / ib65 Port [21] has less than optimal link - SDR }}}

AWOL Link

This is not a counter, but an error as identified by LANL processes. When ibnetdiscover is run, any link that is live but not responding throws an error. These errors are interpretted by ibmon and logged to syslog.

grep -i lid

Port info: DR path slid 65535; dlid 65535; 0,1,1,20,25,16,34 port 0

Lid:.............................1095 SMLid:...........................250

Using the most recent IB fabric map, (/etc/ibmon/data/ibnet_map) use the lid to find this last hop and identify what should be on that last port. This is not always easy, as that port is probably not listed. Sometimes it is easy to discern from the surrounding entries what is missing. Sometimes it is not and referring to an older fabric map (same directory, ibnet_map.yyyymmdd.hhmm) provides the answer.

Many devices reporting IB errors

This alert is sent out to Zenoss and the HPC Network Oncall when more than 20 devices are showing counters and may point to any number of reasons. The most common reason for this alert is when there are a few SymbolErrors reporting on the fabric and a lot of PortRcvRemotePhysicalErrors are showing up accross the fabric. If these PortRcvRemotePhysicalErrors seem to be the majority of the errors when this message is received, run the following command on the master node to look for other counters:

grep ibmon2 /var/log/messages | grep -v PortRcvRemotePhysicalErrors

This will show other devices on the fabric with actual errors. Look at resolving those issues first (even if the counters are not above their thresholds) and that will usually clear the rest of the device counters as well

{{{ Example:

Ibmon2 Alert from mp-master ~ Many devices reporting IB errors }}}

Overview of Error Counters

Contents

IB Error Counter Definitions and Examples

LinkDowned

Linkspeed not at maximum

Linkwidth not at maximum

PortRcvErrors

PortRcvRemotePhysicalErrors

PortRcvSwitchRelayErrors

Port[Rcv|Xmit]ConstraintErrors

PortXmitWait

RcvRemotePhys(ical)Errors

SymbolErrors

VL15Drop

XmtDiscards

LessThanOptLink

AWOL Link

Many devices reporting IB errors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools