From VCS engine_A.log:
2012/04/27 01:19:33 VCS INFO V-16-10001-6557 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Up to Down
2012/04/27 01:19:42 VCS INFO V-16-10001-6556 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Down to Up
2012/04/27 18:23:26 VCS INFO V-16-10001-6557 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Up to Down
2012/04/27 18:23:36 VCS INFO V-16-10001-6556 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Down to Up
From MultiNICB agent (debug) log:
2012/05/29 03:01:41 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb0 =100 MultiNICB.C:checkStatus[970]
2012/05/29 03:01:42 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 =111 MultiNICB.C:checkStatus[970]
2012/05/29 03:01:42 VCS INFO V-16-10001-6557 MultiNICB:Network_MultiNICB:monitor:Device: igb1 went from Up to Down
VCS MultiNICB agent uses haping (/opt/VRTSvcs/bin/MultiNICB/haping) to check lines health. Haping sends ICMP request packets to NetworkHosts configured and waits for reply for NetworkTimeout interval (default 100 msec). If reply is received within this time period, haping returns 100 i.e. link up else error code.
Based on type of error haping returns different error codes.
In this case we are getting haping return value as 111, which is because of timeout. This could happen for multiple reasons:
- Network host agent is trying to reach take more time to reply and haping timesout.
- Reply gets delayed because of high network traffic.
- Network fluctuations cause request/reply packet drop.
- Network host itself is down, so on.
When haping reports timeout for a specific link, MultiNICB agent retries for OfflineTestRepeatCount (default 3) times before reporting link as down i.e. haping is invoked 3 times for the same interface. If haping returns error for all 3 times then only agent reports link as down.
2012/05/29 03:01:41 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb0 = 100 MultiNICB.C:checkStatus[970]
2012/05/29 03:01:42 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 = 111 MultiNICB.C:checkStatus[970]
2012/05/29 03:01:42 VCS INFO V-16-10001-6557 MultiNICB:Network_MultiNICB:monitor:Device: igb1 went from Up to Down
2012/05/29 03:01:42 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:956 calling haping on igb1 MultiNICB.C:checkStatus[956]
2012/05/29 03:01:43 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 = 111 MultiNICB.C:checkStatus[970]
2012/05/29 03:01:43 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:956 calling haping on igb1 MultiNICB.C:checkStatus[956]
2012/05/29 03:01:43 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 = 100 MultiNICB.C:checkStatus[970]
This condition MOSTLY occurs because of network fluctuations or network traffic flood.
This could be confirmed by running haping command and collecting the following output. This data needs to be collected when issue is hit i.e. when haping times out (haping return value = 111)
# /opt/VRTSvcs/bin/MultiNICB/haping -v -g
For example:
# /opt/VRTSvcs/bin/MultiNICB/haping -v -g igb1
Output for ping to default Routter/NetworkHosts configured should also be checked.
# ping -s 10 10
One possible workaround is to increase NetworkTimeout value from default 100ms to say 1000ms as follows:
# haconf -makerw
# hares -modify Network_MultiNICB NetworkTimeout 1000
# haconf -makero -dump
Applies To
Solaris 10
VCS 5.1SP1
MultiNICB resource configured in Base Mode (UseMpathd = 0)