What to do when there is an unexpected fail over with error messages "ERROR V-16-2-13067"

book

Article ID: 100002063

calendar_today

Updated On:

Resolution

[ THE PATTERN OF ERROR MESSAGES ]
1) IP addresses on the e1000g0 failed over to e1000g2 for some reason.
May 2410:34:50 sbos509 in.mpathd[513]: [ID 594170 daemon.error] NIC failure detected on e1000g0 of group MultiNICB_MNICB_10_86_72
May 2410:34:50 sbos509 in.mpathd[513]: [ID 832587 daemon.error] Successfully failed over from NIC e1000g0 to NIC e1000g2
 
2 ) ButIPMultiNICB complained about the error "V-16-2-13075" and "V-16-2-13067" that here were no recognizable IP addresses.
2010/05/2410:34:50 VCS INFO V-16-2-13075 (sbos509)Resource(bpbapdbos3_IPMultiNICB_10_86_77_201) has reported unexpected OFFLINE 1times, which is still within the ToleranceLimit(1).
2010/05/2410:34:50 VCS INFO V-16-2-13075 (sbos509)Resource(bpbapdbos1_IPMultiNICB_10_86_75_182) has reported unexpected OFFLINE 1times, which is still within the ToleranceLimit(1).
2010/05/2410:34:51 VCS INFO V-16-2-13075 (sbos509)Resource(plpapdbos1_IPMultiNICB_10_86_75_183) has reported unexpected OFFLINE 1times, which is still within the ToleranceLimit(1).
2010/05/2410:34:51 VCS INFO V-16-2-13075 (sbos509)Resource(plpapdbos2_IPMultiNICB_10_86_77_194) has reported unexpected OFFLINE 1times, which is still within the ToleranceLimit(1).
2010/05/2410:34:51 VCS INFO V-16-2-13075 (sbos509)Resource(plpapdbos3_IPMultiNICB_10_86_77_202) has reported unexpected OFFLINE 1times, which is still within the ToleranceLimit(1).
..
2010/05/2410:35:20 VCS ERROR V-16-2-13067 (sbos509) Agent is calling clean for resource (bpbapdbos3_IPMultiNICB_10_86_77_201) because the resource became OFFLINE unexpectedly, on its own.
2010/05/2410:35:20 VCS ERROR V-16-2-13067 (sbos509) Agent is calling
clean for resource (bpbapdbos1_IPMultiNICB_10_86_75_182) because the resource became OFFLINE unexpectedly, on its own.
2010/05/2410:35:20 VCS INFO V-16-2-13068 (sbos509)Resource(bpbapdbos1_IPMultiNICB_10_86_75_182) - clean completed successfully.
2010/05/2410:35:20 VCS INFO V-16-2-13068 (sbos509)Resource(bpbapdbos3_IPMultiNICB_10_86_77_201) - clean completed successfully.
2010/05/2410:35:20 VCS INFO V-16-1-10307 Resource bpbapdbos3_IPMultiNICB_10_86_77_201(Owner: UTS, Group: bpb) is offline on sbos509 (Not initiated by VCS)
2010/05/2410:35:20 VCS INFO V-16-1-10307 Resource bpbapdbos1_IPMultiNICB_10_86_75_182(Owner: UTS, Group: bpb) is offline on sbos509 (Not initiated by VCS)
 
By the way, these systems are not in the zone.
 
 
[ THEENTRY OF MAIN.CF ]
MultiNICBMultiNICB_MNICB_10_86_72 (
Critical =0
ResourceOwner= UTS
UseMpathd =1                               <-----So, Solaris mpathd is used to monitor
MpathdCommand= "/usr/lib/inet/in.mpathd -a"
Device@sbos509 = { e1000g0 = 0, e1000g2 = 1 }
)
 
[ Comment] If it is able to be shown why the IPMultiNICB failed,  It will be required to look at why the MULTINICB failed by looking at the mpathd messages in the syslog
1)/var/adm/messages
2) There may have been a lot of failures of the primary interface (e1000g0) with fail-over to   e1000g2 as well as repairs and fail-back to the primary interface as shown below.
 

 

May 2123:05:53 sbos509 in.mpathd[513]: [ID 594170 daemon.error] NIC failure detected on e1000g0 of group MultiNICB_MNICB_10_86_72
May 2123:05:53 sbos509 in.mpathd[513]: [ID 832587 daemon.error] Successfully failed over from NIC e1000g0 to NIC e1000g2
May 2123:06:13 sbos509 SYSMSG: [ID 483912 local0.notice] Uts0052N:ÿ?B|:FSAlert(sbos509) All file systems are at acceptable levels.
May 2123:06:22 sbos509 in.mpathd[513]: [ID 299542 daemon.error] NIC repair detected one1000g0 of group MultiNICB_MNICB_10_86_72
May 2123:06:23 sbos509 in.mpathd[513]: [ID 620804 daemon.error] Successfully failed back to NIC e1000g0
May 2123:08:19 sbos509 in.mpathd[513]: [ID 594170 daemon.error] NIC failure detected on e1000g0 of group MultiNICB_MNICB_10_86_72
May 2123:08:19 sbos509 in.mpathd[513]: [ID 832587 daemon.error] Successfully failedover from NIC e1000g0 to NIC e1000g2
May 2123:08:48 sbos509 in.mpathd[513]: [ID 299542 daemon.error] NIC repair detected one1000g0 of group MultiNICB_MNICB_10_86_72
May 2123:08:48 sbos509 in.mpathd[513]: [ID 620804 daemon.error] Successfully failedback to NIC e1000g0
May 2123:11:07 sbos509 in.mpathd[513]: [ID 594170 daemon.error] NIC failure detectedon e1000g0 of group MultiNICB_MNICB_10_86_72
May 2123:11:07 sbos509 in.mpathd[513]: [ID 832587 daemon.error] Successfully failedover from NIC e1000g0 to NIC e1000g2
May 2123:11:36 sbos509 in.mpathd[513]: [ID 299542 daemon.error] NIC repair detected one1000g0 of group MultiNICB_MNICB_10_86_72
May 2123:11:36 sbos509 in.mpathd[513]: [ID 620804 daemon.error] Successfully failed back to NIC e1000g0
 
 
[EXPLANATION OF PROBLEM ]
1. in.mpathdis is doing the monitoring for the MultiNICB resource .....
2. Every 1minute (when VCS "monitor" the MutliNICB resource) VCS actually gets the status from in.mpathd (and VCS put this in to/var/VRTSvcs/lock/MultiNICB/)
3. This status then gets used by IPMultiNICB resources to determine which on of the interfaces (e1000g0 or e1000g2) is the "primary or live" interface.
4. Once this is determined, VCS will test the IP on that interface ....
5. The big problem is that the MultiNICB (e1000g0 and e1000g2) is not very stable as shown above.
.  There are lots of fail-overs and fail-backs
6. This means the potential is there for VCS to get the status from in.mpathd while it is failing over (or failing back) and thus get a status that the "primary orllive" interface is dead.
.   (it is not really dead, but in the process of failing over)
   Again,the reason for this is because the interfaces are not very stable to start with!!!
 
 
[RESOLUTION ]
1. Make sure that hardware is OK.
2. Make sure the network is actually good (if it was good, it would never fail ...this fails regularly)
3. The only way the customer can get around this unstable network is to make sure that our MultiNICB and IPMultiNICB interfaces have tolerance.
Thus, increasing the tolerance limits for both !!
 
[CURRENT PARAMETERS ]
IPMultiNICBFaultOnMonitorTimeouts 4
IPMultiNICBMonitorInterval 30
IPMultiNICBMonitorTimeout 60
IPMultiNICBOfflineMonitorInterval 300
IPMultiNICBRestartLimit0                        << This can be tunable up to 1
IPMultiNICBToleranceLimit 1
 
MultiNICBFaultOnMonitorTimeouts 4
MultiNICBMonitorInterval 10
MultiNICBMonitorTimeout 60
MultiNICBOfflineMonitorInterval 60
MultiNICBRestartLimit0                            <
MultiNICBToleranceLimit0                        << This can be tunable up to 1
 
[LATEST PARAMETERS ]
IPMultiNICBFaultOnMonitorTimeouts 4
IPMultiNICBMonitorInterval 30
IPMultiNICBMonitorTimeout 60
IPMultiNICBOfflineMonitorInterval 300
IPMultiNICBRestartLimit 1
IPMultiNICBToleranceLimit 4
 
MultiNICBFaultOnMonitorTimeouts 4
MultiNICBMonitorInterval 10
MultiNICBMonitorTimeout 60
MultiNICBOfflineMonitorInterval 60
MultiNICBRestartLimit 1
MultiNICBToleranceLimit 4