The NetBackup VCS cluster resource is showing ONLINE when NetBackup is down.

book

Article ID: 100047929

calendar_today

Updated On:

Description

Error Message

Running bpps -a shows the processes are down.

Running the cluster monitor script returns a status 100 indicating that NetBackup is down, however the NetBackup VCS agent is still reporting to VCS that NetBackup is running. The NetBackup_A agent log does capture the problem correctly. In this case the nbstserv process was down:

VCS INFO V-16-2-13716 Thread(307) Resource(server_NB): Output of the completed operation (monitor)
==============================================
Some Processes are DOWN while others are UP
Following Process are found DOWN: nbstserv

The following message is the root cause of the problem:

VCS INFO V-16-2-13845 Thread(307) Resource(server_NB): Output of the timed out operation (clean)
==============================================
awk: record ` root 24976 1...' has too many fields record number 50

When there is a problem detected in the cluster, the engine_A log shows that VCS was trying to offline NB, but it kept failing:

2020/03/19 07:40:30 VCS ERROR V-16-2-13067 (sbpnd01ba144) Agent is calling clean for resource(server_NBU) because the resource became OFFLINE unexpectedly, on its own.
2020/03/19 07:41:32 VCS ERROR V-16-2-13006 (sbpnd01ba144) Resource(server_NBU): clean procedure did not complete within the expected time.

Cause

The "too many fields" message comes from the version of awk on Solaris.  That program is used by various NetBackup commands to parse the process table output, and awk can fail if any line of the output exceeds its limits; number of words on a line, or number of characters on a line.

The command lines for some NetBackup processes (such as nbwmc, mqbroker, or even bpbrm) may exceed the limits that awk can handle. 

When that happens, the start or stop script cannot accurately determine whether processes are running, and VCS cannot determine the correct status/state of the NetBackup processes.  This issue can affect the NetBackup commands netbackup, bp.kill_all, and bp.start_all.

Resolution

The inability to shutdown NetBackup cleanly has a negative impact on VCS cluster fail-over on Solaris.

A fix for this issue is available by applying an emergency engineering binary (EEB), from E-Track 4000186, to the master server.  A formal resolution to this issue will be part of a future NetBackup release.

 

Issue/Introduction

The NetBackup (NB) Veritas Cluster Server (VCS) cluster resource is showing ONLINE when NetBackup is down. Checking the VCS GUI shows NetBackup is ONLINE and so does hastatus -sum. However bpps shows the NetBackup services are all down. This issue is known to affect NetBackup 8.1 - 8.2 master servers on Solaris; clustered and standalone. A related failure may occur on Solaris media servers.