Running bpps -a shows the processes are down.
Running the cluster monitor script returns a status 100 indicating that NetBackup is down, however the NetBackup VCS agent is still reporting to VCS that NetBackup is running. The NetBackup_A agent log does capture the problem correctly. In this case the nbstserv process was down:
VCS INFO V-16-2-13716 Thread(307) Resource(server_NB): Output of the completed operation (monitor)
==============================================
Some Processes are DOWN while others are UP
Following Process are found DOWN: nbstserv
The following message is the root cause of the problem:
VCS INFO V-16-2-13845 Thread(307) Resource(server_NB): Output of the timed out operation (clean)
==============================================
awk: record ` root 24976 1...' has too many fields record number 50
When there is a problem detected in the cluster, the engine_A log shows that VCS was trying to offline NB, but it kept failing:
2020/03/19 07:40:30 VCS ERROR V-16-2-13067 (sbpnd01ba144) Agent is calling clean for resource(server_NBU) because the resource became OFFLINE unexpectedly, on its own.
2020/03/19 07:41:32 VCS ERROR V-16-2-13006 (sbpnd01ba144) Resource(server_NBU): clean procedure did not complete within the expected time.
The "too many fields" message comes from the version of awk on Solaris. That program is used by various NetBackup commands to parse the process table output, and awk can fail if any line of the output exceeds its limits; number of words on a line, or number of characters on a line.
The command lines for some NetBackup processes (such as nbwmc, mqbroker, or even bpbrm) may exceed the limits that awk can handle.
When that happens, the start or stop script cannot accurately determine whether processes are running, and VCS cannot determine the correct status/state of the NetBackup processes. This issue can affect the NetBackup commands netbackup, bp.kill_all, and bp.start_all.
The inability to shutdown NetBackup cleanly has a negative impact on VCS cluster fail-over on Solaris.
A fix for this issue is available by applying an emergency engineering binary (EEB), from E-Track 4000186, to the master server. A formal resolution to this issue will be part of a future NetBackup release.
hastatus -sum. However bpps shows the NetBackup services are all down.
This issue is known to affect NetBackup 8.1 - 8.2 master servers on Solaris; clustered and standalone. A related failure may occur on Solaris media servers.