There are several actions that an administrator can perform on a volume in response to irregular situations that may arise.
On this page:
The Netmail Store cluster is designed to automatically adapt in the event of a failed volume (hard disk) or a failed node. Every volume within the Netmail Store cluster is checked during the startup of a node. If a volume has been disconnected from the cluster for more than 14 days, it is deemed stale and its contents will not be used unless an administrator specifically overrides this behavior.
Although the 14-day time limit applies to volumes, the upshot of this is that if a node is shutdown for more than 14 days, all of its volumes will be considered stale and they will not be used. After 14 days, an administrator may force a volume to be remounted by modifying the volume specification and adding the “keep” policy option. For details about how this is done, see Volume Specifications.
When a volume that is older than 14 days is forced to return to service, care must be taken because this may resurrect content that had been explicitly deleted through client requests. This is not a problem for content that was deleted through automatic life-point policies because it will be discovered and deleted through Netmail Store’s continuous health checking process.
Movement Between Nodes
Physical volumes can be moved between nodes if this becomes necessary due to hardware failures or other constraints as determined by an administrator. When a volume goes offline due to a failure of the volume, the failure of the node, or the shutdown of a node, the cluster will immediately begin the process of ensuring that the correct number of replicas exists for all the streams in the cluster. If a volume or node returns to the cluster during this operation and prior to the 14-day time limit, the checks will continue, but the replicas on the returned volumes will be considered when validating the stream constraints.
Warning: When adding volumes, either new or those from another machine, to a node, care should be taken to ensure that the node has sufficient RAM to handle the additional storage. If the RAM is not sufficient, the node may be unable to mount some of the volumes.
Volumes may also be moved to nodes that are in a different cluster. When this is done, the streams on that volume become part of the new cluster and they will be checked for the correct constraints within the context of the new cluster.
In order to provide for autonomous operations, a Netmail Store node watches for physical errors when reading and writing to its volumes. If the node receives any physical errors from a volume, the volume is immediately retired and the node will avoid any further requests to the failed device.
Due to the sophistication of modern disk storage devices and interfaces, there are many error detection steps, bad sector re-mapping, and retry attempts that are performed by the underlying disk system. If a physical error propagates up to the Netmail Store software level, there is little chance that a deterministic set of steps can be performed to work around the failure. Additionally, there is no guarantee that the extent of the error can be isolated or that the continued use of the failing device will allow the node to continue to operate normally with its other storage devices. For these reasons, Netmail Store takes the conservative approach of retiring a device upon receipt of any physical errors. If a configurable number of additional errors are received during the retire, the volume will be forced offline. Please refer to Node Configuration for details about the 'ioErrorTolerance' parameter.