NetWorker – Random stalling

One of the things I’ve spent a lot of time with, has been EMC NetWorker (previously Legato NetWorker).

A vaguely common issue is for a process of some kind – backups, staging to tape, restores, etc – for no reason just stop making any new progress.

Once you’ve checked off the common reasons – like making sure you haven’t run out of disk space or usable tapes – it seems like the only option is to restart NetWorker as a whole, losing any in-progress actions (even ones that are to devices that haven’t stalled).

I suspect that random underlying I/O issues can occasionally upset it, and it doesn’t quite recover. But, whatever. How do you make it recover a single device, without restarting the whole thing?

First up, get the PID of the main nsrd process. On Solaris, ps -ef | grep nsrd; or on Linux ps uaxw | grep nsrd.

Assuming the PID is 1234, you next need to run: dbgcommand -p 1234 PrintDevInfo

It should pretty quickly spit out a whole stack of debugging info to /nsr/logs/daemon.raw. It’s moderately complicated, but you should see that it’s a dump of its internal state of each device, including d_device – the *nix device or directory, and mm_number – the unique ID for the nsrmmd process for that device.

So – find the device you’re interested in, and find the mm_number for that device.

Get a list of your nsrmmd processes, eg. ps -ef | nsrmmd or ps auxw | grep nsrmmd. If your mm_number is 5, then there will be a process nsrmmd -n 5

Kill the process, and it should re-spawn by itself on further access.