Overview of Error Counters

From OpenFabrics Alliance Wiki
Revision as of 10:16, 28 October 2014 by Skcoulter (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

IB Error Counter Definitions and Examples

LinkDowned

Just like it says. Usually associated with a node reboot.

If not associated with a reboot, could be a failing connection. (like port flapping)

{{{ Example:

ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - LinkDowned:1 / LinkDowned:3 }}}

Linkspeed not at maximum

Link is not operating at full speed. (i.e. 2.5 Gbps, 5.0Gbps, 10.0Gbps)

Usually a reseat of cable/card resolves the issue.

{{{ Example:

Switch ibb:Line 2:Port 15-ext 15:Lid 435:GUID 8f104003f6398: Linkspeed (2.5Gbps) not at maximum speed supported (2.5Gbpsor5.0Gpbs) }}}

Linkwidth not at maximum

Link is not operating at full width. (i.e. 4x, 8x, 12x)

Usually a reseat of cable/card resolves the issue.

{{{ Example:

Switch ibb:Line 7:Port 11-ext 11:Lid 442:GUID 8f104003f63ca: Linkwidth (1X) not at maximum width supported (1XOR4X) }}}

PortRcvErrors

These errors can be due to local physical errors, local buffer overruns, or receiving a malformed packet.

If a malformed packet is received - this indicates a problem somewhere else on the fabric. Somebody is putting bad messages on the wire.

{{{ Jun 21 08:03:03 mu-master ibmon2[23571]: ib19 Port 36: [SymbolErrorCounter == 3] [PortRcvErrors == 3] - (mu0354) }}}

PortRcvRemotePhysicalErrors

Similar to !PortRcvErrors, the end bad packet EBP flag is set. Usually a problem between the physical and logical layers.

{{{ Oct 3 12:33:05 mu-master ibmon2[15627]: ib89 Port 7: [PortRcvRemotePhysicalErrors == 1] - (ibcore1 Line 28 Port 18) }}}

PortRcvSwitchRelayErrors

This field counts the number of packets that could not be forwarded by the switch.

The reasons for this include

1. VL mapping errors. (LANL has not implemented VLs (yet)).

2. Looping; input port and output port are the same

3. DLID errors

  a. It is a Multicast DLID (0xC000 to 0xFFFE) not configured for this CA
  b. DLID is outside the LFTS range or greater than the LinearFDBTop  
  c. Port associated with this DLID in the LFTS file does not exist.

Usually this is due to the poor implementation of multicast on IB and therefore can be ignored.


{{{ Example:

Switch iba:Line 9:Port 7-ext 7:Lid 404:GUID 8f104003f633a / Switch iba:Spine 3:Port 17-ext 17:Lid 394:GUID 8f100010b0331 - RcvSwRelayErrors:30 / }}}

Port[Rcv|Xmit]ConstraintErrors

This is the number of packets [ received and discarded on | not transmitted by ] a port in the fabric.

There are 2 general reasons for this.

1. The filter for raw packets [ inbound | outbound ] is turned on and these are raw packets

2. The partition key or IP version check has failed.

{{{ Example:

ib47 Port 4: [PortXmitConstraintErrors == 255] [PortRcvConstraintErrors == 255] - (lu0801) }}}

PortXmitWait

This field counts the number of packets that had to wait before being transmitted.

It is almost always non-zero.

Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards.

{{{ Example:

Switch ib:Line 10:Port 24-ext 24:Lid 20:GUID 2c9020041d620 / Switch ib:Spine 06:Port 10-ext 10:Lid 42:GUID 2c9020041eae8 - PortXmitWait:236225217 / PortXmitWait:100104073 }}}

RcvRemotePhys(ical)Errors

This field counts "Total number of packets marked with the EBP delimiter received on the port."

The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric.

{{{ Example:

Switch ib:Spine 03:Port 19-ext 19:Lid 41:GUID 2c9020041ea98 / Switch ib:Line 19:Port 21-ext 21:Lid 5:GUID 2c9020041d268 - / PortRcvRemotePhysicalErrors:13 }}}

SymbolErrors

The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable.


{{{ Example:

ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - / SymbolErrors:65535 }}}


VL15Drop

VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.

{{{ Example:

Switch ib:Line 13:Port 24-ext 24:Lid 15:GUID 2c9020041d470 / Switch ib:Spine 06:Port 13-ext 13:Lid 42:GUID 2c9020041eae8 - / VL15Dropped:17 }}}

XmtDiscards

This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient.

{{{ Example:

Oct 26 00:33:52 ce-master ibmon[5694]: ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - XmtDiscards:2 / }}}

LessThanOptLink

Link is not performing at its optimal speed. To correct, look at what the connections are. If you are able to get to the switch and enable/disable ports, reset ports to see if that corrects issue. If problem still remains, reseat cable. Ultimately if none of the solutions work, a spine/card reseat may be needed.

{{{ Example:

Aug 31 12:25:22 mu-master ibmon2[17708]: Connection: ibcore2 Line 23 Port [11][ext 11] / ib65 Port [21] has less than optimal link - SDR }}}

AWOL Link

This is not a counter, but an error as identified by LANL processes. When ibnetdiscover is run, any link that is live but not responding throws an error. These errors are interpretted by ibmon and logged to syslog.

grep -i lid
  1. Port info: DR path slid 65535; dlid 65535; 0,1,1,20,25,16,34 port 0

Lid:.............................1095 SMLid:...........................250

Using the most recent IB fabric map, (/etc/ibmon/data/ibnet_map) use the lid to find this last hop and identify what should be on that last port. This is not always easy, as that port is probably not listed. Sometimes it is easy to discern from the surrounding entries what is missing. Sometimes it is not and referring to an older fabric map (same directory, ibnet_map.yyyymmdd.hhmm) provides the answer.


Many devices reporting IB errors

This alert is sent out to Zenoss and the HPC Network Oncall when more than 20 devices are showing counters and may point to any number of reasons. The most common reason for this alert is when there are a few SymbolErrors reporting on the fabric and a lot of PortRcvRemotePhysicalErrors are showing up accross the fabric. If these PortRcvRemotePhysicalErrors seem to be the majority of the errors when this message is received, run the following command on the master node to look for other counters:

grep ibmon2 /var/log/messages | grep -v PortRcvRemotePhysicalErrors

This will show other devices on the fabric with actual errors. Look at resolving those issues first (even if the counters are not above their thresholds) and that will usually clear the rest of the device counters as well

{{{ Example:

Ibmon2 Alert from mp-master ~ Many devices reporting IB errors }}}