Difference between revisions of "Overview of Error Counters"
Line 60: | Line 60: | ||
2. The partition key or IP version check has failed. | 2. The partition key or IP version check has failed. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
==== PortXmitWait ==== | ==== PortXmitWait ==== | ||
Line 74: | Line 69: | ||
Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards. | Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
==== RcvRemotePhys(ical)Errors ==== | ==== RcvRemotePhys(ical)Errors ==== | ||
Line 86: | Line 76: | ||
The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric. | The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
==== SymbolErrors ==== | ==== SymbolErrors ==== | ||
The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable. | The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Line 108: | Line 86: | ||
VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion. | VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
==== XmtDiscards ==== | ==== XmtDiscards ==== | ||
− | This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient. | + | This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + |
Latest revision as of 10:23, 28 October 2014
IB Error Counter Definitions and Examples
LinkDowned
Just like it says. Usually associated with a node reboot.
If not associated with a reboot, could be a failing connection. (like port flapping)
Linkspeed not at maximum
Link is not operating at full speed. (i.e. 2.5 Gbps, 5.0Gbps, 10.0Gbps)
Usually a reseat of cable/card resolves the issue.
Linkwidth not at maximum
Link is not operating at full width. (i.e. 4x, 8x, 12x)
Usually a reseat of cable/card resolves the issue.
PortRcvErrors
These errors can be due to local physical errors, local buffer overruns, or receiving a malformed packet.
If a malformed packet is received - this indicates a problem somewhere else on the fabric. Somebody is putting bad messages on the wire.
PortRcvRemotePhysicalErrors
Similar to !PortRcvErrors, the end bad packet EBP flag is set. Usually a problem between the physical and logical layers.
PortRcvSwitchRelayErrors
This field counts the number of packets that could not be forwarded by the switch.
The reasons for this include
1. VL mapping errors. (LANL has not implemented VLs (yet)).
2. Looping; input port and output port are the same
3. DLID errors; It is a Multicast DLID (0xC000 to 0xFFFE) not configured for this CA, or DLID is outside the LFTS range or greater than the LinearFDBTop, or Port associated with this DLID in the LFTS file does not exist.
Usually this is due to the poor implementation of multicast on IB and therefore can be ignored.
Port[Rcv|Xmit]ConstraintErrors
This is the number of packets [ received and discarded on | not transmitted by ] a port in the fabric.
There are 2 general reasons for this.
1. The filter for raw packets [ inbound | outbound ] is turned on and these are raw packets
2. The partition key or IP version check has failed.
PortXmitWait
This field counts the number of packets that had to wait before being transmitted.
It is almost always non-zero.
Really large numbers indicate congestion. If the congestion gets really bad, you will see !XmitDiscards.
RcvRemotePhys(ical)Errors
This field counts "Total number of packets marked with the EBP delimiter received on the port."
The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric.
SymbolErrors
The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable.
VL15Drop
VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.
XmtDiscards
This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient.