Sunday, July 17, 2011

Industrial Ethernet Reliability and Performance: Cisco’s “Errdisable” Functionality

Do you use Cisco Catalyst switches (or Rockwell Automation’s Stratix series of managed switches) on your network?  Have you ever had a port stop working, never to start again?  If so, there is probably nothing at all wrong with your switch.

Before I became acquainted with the Cisco IOS(Internetwork Operating System), I made the same mistake many people do; if a port stops working and I can get my device working again by just moving the connection to another port, the port must be bad.  In my experience with Cisco switches, this is rarely the case.  However, there is a feature that is enabled by default on many Cisco devices called ErrDisable.  This feature is designed to detect network problems and stop them before the rest of the network is affected.  The default behavior is to disable the port in question until  someone intervenes.  In order to re-enable the port, an administrator would have to issue the shutdown command followed by the no shutdown command.  There is also a feature that allows the user to set a recovery interval for the errdisabled state.  If the recovery interval is set, the switch will, on a periodic basis, check the disabled port to see if the error condition still exists.  If the error condition has cleared, the port will be re-enabled.
The guidance provided by Cisco and Rockwell is to set the recovery interval using the errdisable recover interval seconds command.  In conjunction with the errdisable recovery cause errortype command, the recovery configuration can be very granular based on the type of error encountered.  Playing devil’s advocate, I could argue that configuration of the errdisable recovery feature may cause further problems unless the switch logs are being monitored on a regular basis.  Assuming you have an intermittent hardware problem such as a sloppy cable termination that is causing a link flap (a condition in which the physical link is broken more than 5 times in 10 seconds, easily caused by poor terminations and vibration in an industrial environment).  In this case, if errdisable recovery has been established, the problem may never be discovered until there is a catastrophic failure, resulting in manufacturing downtime as opposed to  scheduled maintenance.  My point is, just because recovery keeps data flowing in the short term, the asumption that no problem exists cannot be made.

Monitoring is essential to technology systems reliability, but that is a whole other topic.  Here is a document that outlines some of Cisco and Rockwell Automation’s guidelines for plantwide ethernet:
Detailed information about the errdisabled state per Cisco’s documentation library:

[ Original Post by Jed Leviner]