Child pages
  • Appendix B - Using SNMP with Netmail Store
Skip to end of metadata
Go to start of metadata


The Netmail Store SNMP agent implementation allows you to monitor the health of cluster nodes, collect usage data, and control node actions. This appendix describes how to integrate a Netmail Store cluster into an enterprise SNMP monitoring infrastructure.

On this page:

 

SNMP MIB Reference

If you boot from a CSN, see the following MIBs located at /usr/share/snmp/mibs:

  • CASTOR-MGR-MIB.txt. An aggregate MIB for all cluster nodes.
  • CARINGO-CASTOR-MIB.txt. A standard Netmail Store hardware MIB provided with the Netmail Store SNMP agent

If you do not boot from a CSN, see the CARINGO-CASTOR-MIB.txt MIB located in the root directory of the Netmail Store software distribution.

Netmail Store now allows you to access the standard hardware MIBs distributed with the Net-SNMP package. These MIBs provide hardware reporting for areas such as processor load, memory availability, and network bandwidth. For complete information on the available OIDs, see the Net-SNMP MIB documentation.

Managing Netmail Store Nodes

Netmail Store cluster nodes are controlled through the SNMP action commands. The following OIDs enable you to disable nodes and volumes with nodes from a Netmail Store cluster:

  • castorShutdownAction. Disable nodes and volumes within nodes for servicing.
  • castorRetireAction. Disable nodes and volumes within nodes for retirement.

Shutdown Action for Nodes

To gracefully shut down a Netmail Store node, the string “shutdown” is written to the castorShutdownAction OID. Similarly, writing the string “reboot” to this OID causes a Netmail Store node to reboot.

When a node receives a shutdown or reboot action, it initiates a graceful stop by unmounting all of its volumes and removing itself from the cluster. For a shutdown, the node is powered off if the hardware supports this action. For a reboot, the node will reboot to machine, re-read the node or cluster configuration files, and startup Netmail Store.

A graceful shutdown is required to perform a quick reboot. Performing an ungraceful shutdown forces the node to perform consistency checks on all its volumes before it can rejoin the cluster.

Before shutting down or rebooting a node, check the node status page or the SNMP castorErrTable OID for critical error messages. Any logged critical messages will be cleared upon reboot.

Retire Action for Nodes and Volumes

The Retire action is used to permanently remove a node or a volume within a node from the cluster. This action is intended for retiring legacy hardware or pre-emptively pushing content away from a volume with a history of I/O errors. Retired volumes and nodes are visible in the Netmail Store Admin Console until the cluster is rebooted. See Retiring Volumes for more information about retiring volumes.

Note: The Retire action may take an extended amount of time to complete and requires at least three health processor cycles.

Single Volumes

When a volume is retired, all of its stored objects are moved to other nodes in the Netmail Store cluster. After you initiate a volume retirement, the volume becomes a read-only volume and no additional objects can be stored on it. After all of the objects are moved to other locations in the cluster, the volume is idled with no further read/write requests. Each volume is given a unique name within its node – the device string from the vols line in the configuration file. To retire a volume, its name is written as a string to the castorRetireAction OID. The volume retirement process is initiated immediately upon receipt and the action cannot be aborted after it starts.

To manually retire a volume using the Netmail Store Admin Console, click the targeted node IP address in the console interface. In the Actions column, click Retire next to the targeted volume.

Entire Node

Retiring a node means all volumes on the node are retired at the same time. After all volumes in the node are retired and the node data is copied elsewhere in the cluster, the node is permanently out of service and will not respond to further requests.

To retire a node and all of its volumes, the all string is written to the castorRetireAction OID. The node retirement process is initiated immediately upon receipt and the action cannot be aborted after it starts.

Warning: Ensure that the cluster has both enough free space and nodes to store the objects on a retiring volume. When subclusters are in use, these requirements apply to the subcluster where the retiring volume resides. If the number of nodes in the cluster or subcluster do not have enough space to store at least two replicas of all objects, the retiring node cannot complete the retirement process until you add additional nodes. The Retire action does not require that the configured minreps is maintained to complete retirement. If there are not enough nodes to maintain minreps, retire will log messages stating that sufficient replicas cannot be created.

SNMP Tools and Monitoring Systems

Any standard SNMP query tool and monitoring system can be used to interact with Netmail Store. The examples in this section use the open source Net-SNMP (formerly UCD-SNMP) package that is available for UNIX and Microsoft Windows® platforms. Before using most tools and monitoring packages, install the Netmail Store MIB definition file. See the instructions included with the tool or package for more information.

Open Source Tools

The following tools can be useful to monitor and manage Netmail Store. Netmail does not endorse the applicability nor the fitness of these products when used within any environment.

  • Nagios (http://www.nagios.org). Provides web-based monitoring system for UNIX environments that can monitor systems and send alerts through email and pager.

SNMP Examples with Netmail Store

Before you use the examples in this section, perform the following procedures:

  • Record the IP address of a Netmail Store cluster node. If the cluster is not in your subnet, record the SCSP Proxy.

In the examples below, the node's IP address is 172.16.0.32

  • Run the command from the directory that contains CARINGO-CASTOR-MIB.txt.

For example, copy CARINGO-CASTOR-MIB.txt from the root directory of the USB flash drive or distribution to a local directory.

  • Record the following passwords:
    • read-only-password. The password value for the snmp user defined in the security.operators configuration variable. By default, this password is public.
    • read-write-password. The password value for the snmp user defined in the security.administrators configuration variable. By default, this password is ourpwdofchoicehere.

For more information, see “Managing Netmail Store Administrators and Users”.

The following example shows an SNMP walk of all the Netmail Store values on a node.

snmpwalk -v 2c -c read-only-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 caringo

The following example shows a request for a specific SNMP variable from a Netmail Store node.

snmpget -v 2c -c read-only-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 reads

The following example shows a set request that will shut down a Netmail Store node.

snmpset -v 2c -c read-write-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 castorShutdownAction s shutdown CARINGO-CASTOR-MIB::castorShutdownAction = STRING: "shutdown"

The following example shows a set request that changes the cluster's sleepAfter setting to 7260 seconds (121 minutes).

snmpset -v2c -c read-write-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 sleepAfter i 7260

SNMP Action OIDs

The “action” OIDs in Netmail Store are the SNMP objects that affect the operation of a node or the cluster. To prevent conflicts for cluster level parameters such as volumeRecoverySuspend, the action should only be written to a single node to allow updates to the persisted settings UUID from a single node.

castorFeedRestartAction

Writing to this object allows you to restart a feed on a node using SNMP. When you set the OID value to a specific feed value, the feed restarts on all nodes in the cluster.

The castorFeedTable OID allows you to view the Netmail Store feed information for a specific node. Each entry indicates a feed running on the selected node. The Netmail Store Admin Console allows you to view the SNMP Repository Dump page, which provides node-specific information. See “SNMP Repository Dump” for more information.

castorLogLevelAction

Writing to this object allows you to change the logging level. When a node is booted, it sets the logging level based on the loglevel parameter. You can increase the logging level to debug an issue and then return the level it to its previous value when completed.

castorRetireAction

Writing to this object allows you to remove the contents of a disk volume or an entire node in an orderly fashion. Instead of removing disks, consider retiring disks to save content that may not be saved on another disk. The device name from the node configuration vols parameter or the all string is written to this OID. You can simultaneously retire volumes from multiple nodes in the cluster.

castorShutdownAction

Writing to this object allows you to gracefully shutdown or reboot a node or an entire cluster. The supported values are:

  • shutdown. Shuts down this node only.
  • reboot. Reboots this node only.
  • clustershutdown. Shuts down all nodes in the cluster.
  • clusterreboot. Reboots all nodes in the cluster.

castorSyslogHostAction

Writing to this object allows you to change the logging host for writing log messages. When a node is booted, it sets the logging host based on the loghost parameter. Additionally, you can redirect syslog messages to your workstation to debug an issue.

volumeRecoverySuspend

Writing to this object allows you to suspend volume recovery behavior in the cluster during an upgrade or a network outage.

Practical SNMP with Netmail Store

This section outlines some practical approaches in using the built-in SNMP agent to monitor the health and operational aspects of a Netmail Store cluster. Although you can set up a simple ICMP ping monitor of a Netmail Store node, using the SNMP variables provides detailed indications of disk and capacity problems.

Health Monitoring

The following variables can be used to monitor the basic health of a Netmail Store node. The volume table will have n from 1 to the number of disk volumes.

  • caringo.castor.castorState. Should equal “OK.”
  • caringo.castor.castorVolTable.volEntry.volState.n. Should equal “OK.”
  • caringo.castor.castorVolTable.volEntry.volErrors.n. Should be zero.

If the monitoring console receives timeouts when trying to read these variables, there is something wrong with the node. If the state values are anything other than “ok,” the node or the disks are transitioning from their normal state.

The valid states for a node are:

  • OK
  • Retiring
  • Retired

The valid states for a disk volume are:

  • OK
  • Retiring
  • Retired
  • Unavailable

Any non-zero value in the volume error count indicates that a hard error has surfaced from the disk hardware through the OS driver and to the Netmail Store process.

Capacity Monitoring

The following variables can be monitored and collected for capacity alerting and reporting. The volume table will have n from 1 to the number of disk volumes.

  • caringo.castor.castorFreeSlots. Should be greater than zero.
  • caringo.castor.castorVolTable.volEntry.volMaxMbytes.n
  • caringo.castor.castorVolTable.volEntry.volFreeMbytes.n
  • caringo.castor.castorVolTable.volEntry.volTrappedMbytes.n

The castorFreeSlots variable indicates how many more objects a node can hold before it exhausts its memory index. If this occurs, the node is unable to store additional objects until objects are deleted or moved to other cluster nodes (or more RAM is added to the node). The free slots is indicates how much RAM is required per object. For more information, see the section on memory effects on node storage in the chapter on hardware considerations in the Netmail Store Getting Started Guide.

To compute the amount of disk space that is available for writing content, add the values volFreeMbytes and volTrappedMbytes. Thus, the percent free space on a disk volume is:

(volFreeMbytes + volTrappedMbytes) / volMaxMbytes

Similarly, the percent of space being used by current content is:

volUsedMbytes / volMaxMbytes

These disk usage variables can be totaled for all disk volumes in a node and all nodes in a cluster to produce capacity utilization reports.

Client Activity Reporting

You can collect and report the amount of client activity received by the nodes to understand the end-user usage patterns and identify nodes that may be receiving significantly more activity than others. The resulting value can indicate a poor primary access node selection mechanism in the client application code.

The following SNMP variables indicate client request activity on a Netmail Store node.

  • caringo.castor.scsp.writes
  • caringo.castor.scsp.reads
  • caringo.castor.scsp.infos
  • caringo.castor.scsp.deletes
  • caringo.castor.scsp.errors
  • caringo.castor.scsp.updates
  • caringo.castor.scsp.copies
  • caringo.castor.scsp.appends

SNMP Repository Dump

The SNMP Repository Dump page provides node-specific information that is not available in the Netmail Store Admin Console.

To access the SNMP Repository Dump page for a cluster node:

1. Open the Netmail Store Admin Console.

2. In the Node IP column, click the IP address of the target node.

3. Scroll down and maximize Node Info.

4. Scroll down and click SNMP Repository.

  • No labels