VCS MultiNICB resource sees frequent link UP / DOWN messages

Description

Error Message

From VCS engine_A.log:

2012/04/27 01:19:33 VCS INFO V-16-10001-6557 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Up to Down

2012/04/27 01:19:42 VCS INFO V-16-10001-6556 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Down to Up



2012/04/27 18:23:26 VCS INFO V-16-10001-6557 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Up to Down

2012/04/27 18:23:36 VCS INFO V-16-10001-6556 (n933) MultiNICB:Network_MultiNICB:monitor:Device: igb0 went from Down to Up

From MultiNICB agent (debug) log:

2012/05/29 03:01:41 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb0 =100 MultiNICB.C:checkStatus[970]

2012/05/29 03:01:42 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 =111 MultiNICB.C:checkStatus[970]

2012/05/29 03:01:42 VCS INFO V-16-10001-6557 MultiNICB:Network_MultiNICB:monitor:Device: igb1 went from Up to Down

Cause

VCS MultiNICB agent uses haping (/opt/VRTSvcs/bin/MultiNICB/haping) to check lines health. Haping sends ICMP request packets to NetworkHosts configured and waits for reply for NetworkTimeout interval (default 100 msec). If reply is received within this time period, haping returns 100 i.e. link up else error code.

Based on type of error haping returns different error codes.

In this case we are getting haping return value as 111, which is because of timeout. This could happen for multiple reasons:
- Network host agent is trying to reach take more time to reply and haping timesout.
- Reply gets delayed because of high network traffic.
- Network fluctuations cause request/reply packet drop.
- Network host itself is down, so on.

When haping reports timeout for a specific link, MultiNICB agent retries for OfflineTestRepeatCount (default 3) times before reporting link as down i.e. haping is invoked 3 times for the same interface. If haping returns error for all 3 times then only agent reports link as down.

2012/05/29 03:01:41 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb0 = 100 MultiNICB.C:checkStatus[970]

2012/05/29 03:01:42 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 = 111 MultiNICB.C:checkStatus[970]

2012/05/29 03:01:42 VCS INFO V-16-10001-6557 MultiNICB:Network_MultiNICB:monitor:Device: igb1 went from Up to Down

2012/05/29 03:01:42 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:956 calling haping on igb1 MultiNICB.C:checkStatus[956]

2012/05/29 03:01:43 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 = 111 MultiNICB.C:checkStatus[970]

2012/05/29 03:01:43 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:956 calling haping on igb1 MultiNICB.C:checkStatus[956]

2012/05/29 03:01:43 VCS DBG_4 V-16-50-0 MultiNICB:Network_MultiNICB:monitor: In checkStatus:970 haping status for igb1 = 100 MultiNICB.C:checkStatus[970]

Resolution

This condition MOSTLY occurs because of network fluctuations or network traffic flood.

This could be confirmed by running haping command and collecting the following output. This data needs to be collected when issue is hit i.e. when haping times out (haping return value = 111)

# /opt/VRTSvcs/bin/MultiNICB/haping -v -g

For example:

# /opt/VRTSvcs/bin/MultiNICB/haping -v -g igb1

Output for ping to default Routter/NetworkHosts configured should also be checked.

# ping -s 10 10

One possible workaround is to increase NetworkTimeout value from default 100ms to say 1000ms as follows:

# haconf -makerw

# hares -modify Network_MultiNICB  NetworkTimeout 1000

# haconf -makero -dump

Applies To

Solaris 10

VCS 5.1SP1

MultiNICB resource configured in Base Mode (UseMpathd = 0)

Issue/Introduction

Seeing frequent link UP / DOWN messages for VCS MultiNICB resource.

Additional Information

ETrack: 2804288

Welcome to "KB Articles"