Difference between revisions of "Overview of Error Counters"

From OpenFabrics Alliance Wiki
Jump to: navigation, search
(Created page with "=== IB Error Counter Definitions and Examples === ==== LinkDowned ==== Just like it says. Usually associated with a node reboot. If not associated with a reboot, could be...")
 
 
(2 intermediate revisions by one user not shown)
Line 8: Line 8:
 
If not associated with a reboot, could be a failing connection.  (like port flapping)
 
If not associated with a reboot, could be a failing connection.  (like port flapping)
  
{{{
 
Example:
 
 
ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - LinkDowned:1 / LinkDowned:3 
 
}}}
 
  
 
==== Linkspeed not at maximum ====
 
==== Linkspeed not at maximum ====
Line 19: Line 14:
  
 
Usually a reseat of cable/card resolves the issue.
 
Usually a reseat of cable/card resolves the issue.
 
{{{
 
Example:
 
 
Switch ibb:Line 2:Port 15-ext 15:Lid 435:GUID 8f104003f6398: Linkspeed (2.5Gbps) not at maximum speed supported (2.5Gbpsor5.0Gpbs)
 
}}}
 
  
 
==== Linkwidth not at maximum ====
 
==== Linkwidth not at maximum ====
Line 32: Line 21:
 
Usually a reseat of cable/card resolves the issue.
 
Usually a reseat of cable/card resolves the issue.
  
{{{
 
Example:
 
 
Switch ibb:Line 7:Port 11-ext 11:Lid 442:GUID 8f104003f63ca: Linkwidth (1X) not at maximum width supported (1XOR4X)
 
}}}
 
  
 
==== PortRcvErrors ====
 
==== PortRcvErrors ====
Line 44: Line 28:
 
If a malformed packet is received - this indicates a problem somewhere else on the fabric.  Somebody is putting bad messages on the wire.
 
If a malformed packet is received - this indicates a problem somewhere else on the fabric.  Somebody is putting bad messages on the wire.
  
{{{
 
Jun 21 08:03:03 mu-master ibmon2[23571]: ib19  Port 36: [SymbolErrorCounter == 3] [PortRcvErrors == 3] - (mu0354)
 
}}}
 
  
 
==== PortRcvRemotePhysicalErrors ====
 
==== PortRcvRemotePhysicalErrors ====
Line 53: Line 34:
 
Usually a problem between the physical and logical layers.
 
Usually a problem between the physical and logical layers.
  
{{{
 
Oct  3 12:33:05 mu-master ibmon2[15627]: ib89  Port 7: [PortRcvRemotePhysicalErrors == 1] - (ibcore1 Line 28 Port 18)
 
}}}
 
  
 
==== PortRcvSwitchRelayErrors ====
 
==== PortRcvSwitchRelayErrors ====
Line 67: Line 45:
 
2. Looping; input port and output port are the same
 
2. Looping; input port and output port are the same
  
3. DLID errors
+
3. DLID errors; It is a Multicast DLID (0xC000 to 0xFFFE) not configured for this CA, or DLID is outside the LFTS range or greater than the LinearFDBTop, or Port associated with this DLID in the LFTS file does not exist.
  a. It is a Multicast DLID (0xC000 to 0xFFFE) not configured for this CA
+
 
+
  b. DLID is outside the LFTS range or greater than the LinearFDBTop
+
 
+
  c. Port associated with this DLID in the LFTS file does not exist.
+
  
 
Usually this is due to the poor implementation of multicast on IB and therefore can be ignored.
 
Usually this is due to the poor implementation of multicast on IB and therefore can be ignored.
  
 
{{{
 
Example:
 
 
Switch iba:Line 9:Port 7-ext 7:Lid 404:GUID 8f104003f633a / Switch iba:Spine 3:Port 17-ext 17:Lid 394:GUID 8f100010b0331 - RcvSwRelayErrors:30  /
 
}}}
 
  
 
==== Port[Rcv|Xmit]ConstraintErrors ====
 
==== Port[Rcv|Xmit]ConstraintErrors ====
Line 93: Line 60:
 
2. The partition key or IP version check has failed.
 
2. The partition key or IP version check has failed.
  
{{{
 
Example:
 
 
ib47 Port 4: [PortXmitConstraintErrors == 255] [PortRcvConstraintErrors == 255] - (lu0801)
 
}}}
 
  
 
==== PortXmitWait ====
 
==== PortXmitWait ====
Line 107: Line 69:
 
Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards.
 
Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards.
  
{{{
 
Example:
 
 
Switch ib:Line 10:Port 24-ext 24:Lid 20:GUID 2c9020041d620 / Switch ib:Spine 06:Port 10-ext 10:Lid 42:GUID 2c9020041eae8 - PortXmitWait:236225217  / PortXmitWait:100104073
 
}}}
 
  
 
==== RcvRemotePhys(ical)Errors ====
 
==== RcvRemotePhys(ical)Errors ====
Line 119: Line 76:
 
The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet.  So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP.  If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets.  The CA that reports this error is NOT where the corruption occurred.  It occurred elsewhere in the fabric.
 
The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet.  So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP.  If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets.  The CA that reports this error is NOT where the corruption occurred.  It occurred elsewhere in the fabric.
  
{{{
 
Example:
 
 
Switch ib:Spine 03:Port 19-ext 19:Lid 41:GUID 2c9020041ea98 / Switch ib:Line 19:Port 21-ext 21:Lid 5:GUID 2c9020041d268 -  / PortRcvRemotePhysicalErrors:13
 
}}}
 
  
 
==== SymbolErrors ====
 
==== SymbolErrors ====
  
 
The interpretation of symbols within the packet is done on the HCA/CA.  If the translation or interpretation fails, it creates a minor event called a symbol error.  99% of all !SymbolErrors are hardware related.  If the counts are small (small being a relative term that is up for interpretation) they can be ignored.  If the numbers are large and/or the same CA is reporting this error regularly it should be looked into.  On a node, the HCA and/or cable should be reseated.  If the reseat is unsuccessful it should be replaced.  On a switch, reseat the cable or replace the cable.
 
The interpretation of symbols within the packet is done on the HCA/CA.  If the translation or interpretation fails, it creates a minor event called a symbol error.  99% of all !SymbolErrors are hardware related.  If the counts are small (small being a relative term that is up for interpretation) they can be ignored.  If the numbers are large and/or the same CA is reporting this error regularly it should be looked into.  On a node, the HCA and/or cable should be reseated.  If the reseat is unsuccessful it should be replaced.  On a switch, reseat the cable or replace the cable.
 
 
{{{
 
Example:
 
 
ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - / SymbolErrors:65535
 
}}}
 
  
  
Line 141: Line 86:
 
VL15 is the default virtual lane for management packets.  They are the first to be dropped when there are resource limitations on the port.  This is usually related to not enough space in the buffers.  In many instances these errors can be ignored.  There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space.  Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.
 
VL15 is the default virtual lane for management packets.  They are the first to be dropped when there are resource limitations on the port.  This is usually related to not enough space in the buffers.  In many instances these errors can be ignored.  There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space.  Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.
  
{{{
 
Example:
 
 
Switch ib:Line 13:Port 24-ext 24:Lid 15:GUID 2c9020041d470 / Switch ib:Spine 06:Port 13-ext 13:Lid 42:GUID 2c9020041eae8 - / VL15Dropped:17
 
}}}
 
  
 
==== XmtDiscards ====
 
==== XmtDiscards ====
  
This counter tracks packets that were discarded instead of transmitted.  This usually indicates congestion in the fabric.  The CA this packet was supposed to be sent to cannot accept it.  After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped.  If the fabric is being routed well, without deadlocks or credit loops, these should be transient.
+
This counter tracks packets that were discarded instead of transmitted.  This usually indicates congestion in the fabric.  The CA this packet was supposed to be sent to cannot accept it.  After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped.  If the fabric is being routed well, without deadlocks or credit loops, these should be transient.
 
+
{{{
+
Example:
+
 
+
Oct 26 00:33:52 ce-master ibmon[5694]: ceb172a HCA-1:Port 1-ext 1:Lid 267:GUID 2c90300021a12 / Switch ibb:Line 2:Port 16-ext 16:Lid 435:GUID 8f104003f6398 - XmtDiscards:2  /
+
}}}
+
 
+
==== LessThanOptLink ====
+
Link is not performing at its optimal speed.  To correct, look at what the connections are.
+
If you are able to get to the switch and enable/disable ports, reset ports to see if that corrects issue.
+
If problem still remains, reseat cable.  Ultimately if none of the solutions work, a spine/card reseat may be needed.
+
 
+
{{{
+
Example:
+
 
+
Aug 31 12:25:22 mu-master ibmon2[17708]: Connection: ibcore2 Line 23 Port [11][ext 11] / ib65 Port [21] has less than optimal link - SDR
+
}}}
+
 
+
==== AWOL Link ====
+
 
+
This is not a counter, but an error as identified by LANL processes.
+
When ibnetdiscover is run, any link that is ''live'' but not responding throws an error.
+
These errors are interpretted by ibmon and logged to syslog.
+
 
+
{{{
+
Example:
+
 
+
Aug  9 12:33:15 mu-master ibmon2[32424]: AWOL Link: (DR path slid 0; dlid 0; 0,1,1,20,25,16,34 Attr 0x11:0)
+
 
+
[root@mu-master ~]# smpquery portinfo -D 0,1,1,20,25,16,34
+
This should result in an error.  If it does not you will see information like what is in the next example.
+
If this works - it means the node/port was probably coming up but not yet able to respond to a MAD packet.
+
If it fails, remove the last port number and run again, grep'ing for the LID.
+
 
+
[root@mu-master ~]# smpquery pi -D 0,1,1,20,25,16,34 | grep -i lid
+
# Port info: DR path slid 65535; dlid 65535; 0,1,1,20,25,16,34 port 0
+
Lid:.............................1095
+
SMLid:...........................250
+
 
+
Using the most recent IB fabric map, (/etc/ibmon/data/ibnet_map) use the lid to find this last hop and identify what should be on that last port.  This is not always easy, as that port is probably not listed.  Sometimes it is easy to discern from the surrounding entries what is missing.  Sometimes it is not and referring to an older fabric map (same directory, ibnet_map.yyyymmdd.hhmm) provides the answer.
+
}}}
+
 
+
==== Many devices reporting IB errors ====
+
This alert is sent out to Zenoss and the HPC Network Oncall when more than 20 devices are showing counters and may point to any number of reasons.  The most common reason for this alert is when there are a few SymbolErrors reporting on the fabric and a lot of PortRcvRemotePhysicalErrors are showing up accross the fabric.  If these PortRcvRemotePhysicalErrors seem to be the majority of the errors when this message is received, run the following command on the master node to look for other counters:
+
 
+
grep ibmon2 /var/log/messages | grep -v PortRcvRemotePhysicalErrors
+
 
+
This will show other devices on the fabric with actual errors.  Look at resolving those issues first (even if the counters are not above their thresholds) and that will usually clear the rest of the device counters as well
+
 
+
{{{
+
Example:
+
 
+
Ibmon2 Alert from mp-master ~ Many devices reporting IB errors
+
}}}
+

Latest revision as of 10:23, 28 October 2014

IB Error Counter Definitions and Examples

LinkDowned

Just like it says. Usually associated with a node reboot.

If not associated with a reboot, could be a failing connection. (like port flapping)


Linkspeed not at maximum

Link is not operating at full speed. (i.e. 2.5 Gbps, 5.0Gbps, 10.0Gbps)

Usually a reseat of cable/card resolves the issue.

Linkwidth not at maximum

Link is not operating at full width. (i.e. 4x, 8x, 12x)

Usually a reseat of cable/card resolves the issue.


PortRcvErrors

These errors can be due to local physical errors, local buffer overruns, or receiving a malformed packet.

If a malformed packet is received - this indicates a problem somewhere else on the fabric. Somebody is putting bad messages on the wire.


PortRcvRemotePhysicalErrors

Similar to !PortRcvErrors, the end bad packet EBP flag is set. Usually a problem between the physical and logical layers.


PortRcvSwitchRelayErrors

This field counts the number of packets that could not be forwarded by the switch.

The reasons for this include

1. VL mapping errors. (LANL has not implemented VLs (yet)).

2. Looping; input port and output port are the same

3. DLID errors; It is a Multicast DLID (0xC000 to 0xFFFE) not configured for this CA, or DLID is outside the LFTS range or greater than the LinearFDBTop, or Port associated with this DLID in the LFTS file does not exist.

Usually this is due to the poor implementation of multicast on IB and therefore can be ignored.


Port[Rcv|Xmit]ConstraintErrors

This is the number of packets [ received and discarded on | not transmitted by ] a port in the fabric.

There are 2 general reasons for this.

1. The filter for raw packets [ inbound | outbound ] is turned on and these are raw packets

2. The partition key or IP version check has failed.


PortXmitWait

This field counts the number of packets that had to wait before being transmitted.

It is almost always non-zero.

Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards.


RcvRemotePhys(ical)Errors

This field counts "Total number of packets marked with the EBP delimiter received on the port."

The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric.


SymbolErrors

The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable.


VL15Drop

VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.


XmtDiscards

This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient.