Child pages
  • Appendix D - Troubleshooting Netmail Store
Skip to end of metadata
Go to start of metadata


This appendix provides information to help you troubleshoot and resolve issues in your Netmail Store storage cluster.

On this page:

 

Restoring Domains and Buckets

This section describes how to recover deleted domains or buckets in your Netmail Store cluster.

Recovering a Deleted Domain

To recover a domain, you will need the name of a bucket that was contained in the domain. The following procedure shows how to create the domain from the command line using the previous domain's UUID.

If you create a domain with an identical name that is currently in the Admin Console, the new domain is mapped to a different UUID. Since the buckets in the deleted domain reference the deleted domain UUID (for example, Castor-System-CID), you cannot access the buckets unless you set the new domain's UUID to the previous value. Additionally, you will need the Castor-Authorization header corresponding to the domain's protection setting, as shown in the table below:

Protection SettingCastor-Authorization Header

All Users. No authentication required

Castor-Authorization: domain-name/_administrators, POST=

Only users in this domain

Castor-Authorization: domain-name/_administrators, POST=domain-name

Only users in domain

Castor-Authorization: domain-name/_administrators, POST=domain-name

The difference between this protection setting and the preceding is that in this case, domain-name is the name of a different domain in the cluster.

To recover the domain:

1. Add a Content Router filter rule to search for streams where the value of the Castor-System-Name header is a bucket in the domain.

2. Using the SDK, instantiate a metadata enumerator subscribed to the rule channel you created in the preceding step to obtain the bucket's metadata.

3. In the returned object metadata, search for the Castor-System-CID header value. The Castor-System-CID header is the UUID of the domain that contained the bucket.

4. POST the previous domain's UUID using the recreatecid query argument to create the new domain, passing in the Castor-Authorization, Castor-Stream-Type, and lifepoint headers exactly as shown below.

curl -i -X POST -H 'Castor-Authorization: protection-setting' -H 'Castor-Stream-Type: admin' -H 'lifepoint: [] reps=16' --databinary '' --post301 --location-trusted 'http://node-ip?domain=domainname&
admin&recreatecid=previous-domain-UUID' --anyauth -u 'yourusername: your-password' [-D log-file-name]

Note: The lifepoint, and Castor-Stream-Type headers must be entered exactly as shown above to match the headers used when domains are created by the Admin Console. lifepoint: [] reps=16 enables the domain to be replicated as many times as possible. Castor-Stream-Type: admin is recommended for all objects that use a Castor-Authorization header.

For example, if the domain name is cluster.example.com with the protection setting Only users in this domain, and the old domain alias was c0d0fa42bccac73cd3f2324bb53e40a5, enter the following command:

curl -i -X POST -H 'Cache-Control: no-cache-context' -H 'Castor-Authorization: cluster.example.com/_administrators, POST=cluster.example.com' -H 'Castor-Stream-Type: admin' -H 'lifepoint: [] reps=16' --databinary '' --post301 --location-trusted 'http://172.16.0.35?domain=cluster.example.com&admin&recreatecid=c0d0fa42bccac73cd3f2324bb53e40a5' --anyauth -u 'admin:ourpwdofchoicehere'

5.  Create the _administrators bucket for the domain.

curl -i -X POST -H 'Cache-Control: no-cache-context' -H 'Castor- Authorization: domain-name/_administrators' -H 'Castor-Stream-Type: admin' -H 'lifepoint: [] reps=16' --data-binary '' --post301 --location-trusted'http://node-ip/_administrators?domain=domain-name&admin' --anyauth -u 'your-username:your-password' [-D log-file-name]

Note: The following error indicates you omitted --post301 from the command: CAStor Error Content-Length header is required

For example, to create the cluster.example.com/_administrators bucket:

curl -i -X POST -H 'Cache-Control: no-cache-context' -H 'Castor-Authorization: cluster.example.com/_administrators' -H 'Castor-Stream-Type: admin' -H 'lifepoint: [] reps=16' --data-binary '' --post301 --location-trusted 'http://172.16.0.35/_administrators?domain=cluster.example.com&admin' --anyauth -u 'admin:ourpwdofchoicehere'

6. To verify the procedure, start the Admin Console.

7. On the Cluster Status page, click Settings.

8. In the Cluster Tenants section, ensure that the domain name and protection setting display correctly

9. Click Edit next to the name of the domain you just restored.

10. Click Add Domain Manager.

11. Follow the prompts on your screen to create a domain manager. If you added a domain manager successfully, the procedure completed successfully. The Admin Console may display an alert for not having an _administrators bucket in the domain. 

12. Optional. Return to the Cluster Settings page and click the IP address of any node with a red Alert message that is similar to the following:

Error reading admin bucket 'cluster.example.com/_administrators' ([Errno 2] Bucket not found)

13. Click Clear Errors to confirm.

Recovering a Deleted Bucket

When you delete a bucket, the objects in the bucket are not deleted, but they are inaccessible until you recover the bucket.

To recover a deleted domain or bucket, locate:

  • The name of a child object.

For example, if a bucket was deleted, you must know the name of an object contained in that bucket.

  • The Content Router product. Use the Content Router's metadata enumerator to find the contained object's metadata. The metadata enumerator iterates through all of the cluster objects and returns information about those objects.

For example, if a bucket was deleted, the metadata enumerator cannot locate the bucket, but it can locate objects contained in the bucket because the objects were not deleted. Knowing the name of an object, you can locate the bucket's UUID, which you can use to recover the bucket.

In the following procedure, assume that an application developer notifies you that the following objects are not accessible:

photo1.jpg, photo2.jpg, photo3.jpg

You do not know the name of the bucket in which the objects were contained.

To recover the bucket:

1. Add a Content Router filter rule to search for streams where the value of the Castor-System-Name header is photo1.jpg, photo2.jpg, photo3.jpg.

2. Using the SDK, instantiate a metadata enumerator subscribed to the rule channel you created in the preceding step to obtain the object's metadata.

3. In the returned object metadata, search for the Castor-System-CID header value. The Castor-System-CID header is the UUID of the bucket that contained the object.

4. After you find the bucket's UUID, use the following command to recover it:

curl -i -X POST --post301 --anyauth -u 'cluster-administratorusername:password' --data-binary @bucket --location-trusted 'http://nodeip/bucket-name?domain=domain-name&admin&recreatecid=alias-uuid'

You must provide the domain name or IP address as the Host in the request.

For example,

to recover a bucket named: mybucket

with the alias UUID: of 75edd708dc250137849bbf590458d401

in the domain named cluster.example.com

enter:

curl -i --post301 --anyauth -u 'admin:ourpwdofchoicehere' -X POST --location-trusted 'http://172.16.0.35/mybucket?domain=cluster.example.com&admin&recreatecid=75edd708dc250137849bbf590458d401'

Resolving Duplicate Domain Names in a Mirrored or Disaster Recovery Cluster

This section discusses how to resolve duplicate domain names in a mirrored or disaster recovery (DR) cluster.

Using Content Router, you can create two types of DR cluster configurations:

  • DR Cluster. Copies one or more clusters and their contents in another physical location.
  • Mirrored Configuration. Copies the contents of cluster 1 to cluster 2, and copies the contents of cluster 2 to cluster 1.

In either type of configuration, if two source clusters contain two domains with the same name, Content Router duplicates the domain names in the DR or mirrored cluster. This results in indeterminate access to objects in the duplicated domains. Sometimes a request to a particular object in one of the duplicate domains succeeds, but other times it fails.

When Content Router detects a duplicate domain, it logs a Critical error to its Netmail Store Admin Console. If you receive a Critical error, Caringo, perform one of the following procedures:

  • Rename either domain in its source cluster (recommended for a DR cluster conflict). This method resolves the issue and prevents it from happening in the future. You can perform this task using the Admin Console in the source cluster.

Note: This method does not work in a mirrored configuration because both clusters have duplicates. In this situation, use the next procedure.

  • Rename either conflicting domain. This is the only method you can use in a mirrored cluster conflict. It resolves the issue and prevents it from reoccurring. For a DR cluster conflict, this method is not recommended because the next time the same domain is replicated to the DR cluster, the duplicate domain name still exists.

Renaming a Domain in its Source Cluster (DR Cluster Conflict Only)

This section discusses how to rename a domain in its source cluster, where the name of the domain is assumed to be unique. After you rename the domain, it replicates without errors to the DR cluster. To resolve a conflict in a mirrored configuration, refer to Renaming a Domain in a Mirrored or DR Cluster.

To rename a domain in the source cluster of a DR cluster:

1. Start the Admin Console.

2. On the first page, click Settings.

3. On the Cluster Settings page, click Edit next to the name of the domain you want to rename.

4. In the Add Cluster Tenant section, enter a new name in the Domain Name field.

5. Click Save.

6. If prompted, enter an administrator user name and password.

Renaming a Domain in a Mirrored or DR Cluster

To rename a domain in a mirrored or DR cluster, use an SCSP COPY command with the following query arguments, and authenticate as a cluster administrator.

Query ArgumentMeaning
admin

Also referred to as administrative override, this query argument enables you to ignore Allow headers and bypass the Castor-Authorization header.

This query argument requires your cluster administrator credentials. Note that administrative override does not impact lifepoint policy deletability for immutable objects.

newname=new-domain-name

The new name for the domain. Make sure you use a name that follows the rules discussed in “Rules and Recommendations for Managing Tenants”

aliasuuid=domain-UUID

The UUID of the domain to rename.

You can find the UUID using a HEAD on the domain in its source cluster, as discussed in the example following the table.

To rename a domain in a mirrored or DR cluster:

1. HEAD the alias UUID of the domain to rename.

Use SCSP INFO command as follows:

INFO /?domain=domain-name&admin
Host: domain-name-or-ip

You must authenticate as a cluster administrator (that is, a user in the security.administrators parameter).

The value of the Castor-System-Alias header is the domain's UUID.

You must also get the value of the Castor-Authorization header from the HEAD request and pass it in using the new cluster name with the rename command as shown in the next step.

2. Rename the domain.

curl -X COPY -H 'Castor-Authorization: renamed-value-from-HEAD' -H 'lifepoint: [] reps=16' -H 'Castor-Stream-Type: admin' --anyauth -u 'cluster-administrator-username:password' --location-trusted 'http://nodeip?domain=domain-name&admin&aliasuuid=uuid&newname=new-domain-name'

For example, to rename cluster.example.com to archive.example.com by sending commands to a node whose IP address is 172.16.0.35:

1. HEAD the domain to get its alias UUID:

curl -I --anyauth -u 'admin:ourpwdofchoicehere' --location-trusted
'http://172.16.0.35?domain=cluster.example.com&admin'

Sample output follows:

HTTP/1.1 200 OK
Cache-Control: no-cache-context
Castor-Authorization: cluster.example.com/_administrators, POST=cluster.example.com
Castor-Stream-Type: admin
Castor-System-Alias: bbc2365b3283c23c47595abcfd09034a
Castor-System-CID: ffffffffffffffffffffffffffffffff
Castor-System-Cluster: cluster.example.com
Castor-System-Created: Wed, 17 Nov 2010 15:59:13 GMT
Castor-System-Name: cluster.example.com
Castor-System-Owner: admin@CAStor administrator
Castor-System-Version: 1290009553.775
Content-Length: 0
Last-Modified: Wed, 17 Nov 2010 15:59:13 GMT
lifepoint: [] reps=16
Etag: "099e2bc25eb8346ed5d94a598fa73bfa"
Date: Wed, 17 Nov 2010 16:02:07 GMT
Server: CAStor Cluster/5.0.0

The information you need to rename the domain is:

Castor-Authorization: cluster.example.com/_administrators, POST=cluster.example.com

You must change this header to: Castor-Authorization: archive.example.com/_administrators, POST=archive.example.com
Castor-System-Alias: bbc2365b3283c23c47595abcfd09034a

You must also add the following headers exactly as shown:

    • -H 'Castor-Stream-Type: admin'
    • -H 'lifepoint: [] reps=16'

Note: The lifepoint, and Castor-Stream-Type headers must be entered exactly as shown to match the headers used when domains are created by the Admin Console. lifepoint: [] reps=16 enables the domain to be replicated as many times as possible. Castor-Stream-Type: admin is recommended for all objects that use a Castor-Authorization header.

2. Rename the domain.

curl -i -X COPY -H 'Castor-Authorization: archive.example.com/_administrators, POST=archive.example.com' -H 'Castor-Stream-Type: admin' -H 'lifepoint: [] reps=16' --anyauth -u 'admin:ourpwdofchoicehere' --location-trusted 'http://172.16.0.35?domain=cluster.example.com&admin&aliasuuid=bbc2365b3283c23c47595abcfd09034a&newname=archive.example.com' -D rename-domain.log

3. Verify the new domain name using the Admin Console.

Using Content Router to List Buckets and Objects

To optionally use Content Router to list the buckets in a domain or objects in a bucket:

1. Find the value of the Castor-System-CID for the child of an object to list.

For example, to list all buckets in a domain, INFO an object in the bucket to find the value of the object's Castor-System-CID header. (The Castor-System-CID of the object is the Castor-System-Alias of its parent, the bucket.)

2. Add a Content Router filter rule to search for streams where the value of the Castor-System-CID header matches the value in step 1 and the value of Castor-System-Alias is not null.

(The Castor-System-Alias of a named object is null.)

3. Using the SDK, instantiate a metadata enumerator subscribed to the rule channel you created in the preceding step to obtain the stream's metadata.

4. In the metadata returned for the object, look for the value of the Castor-System-Name header.

Boot Errors

Refer to the following table for help with boot errors.

SymptomResolution

1. When booting, the node generates an error stating that a boot device is not available.

2. The node boots into an operating system other than Netmail Store.

If you are booting from a USB device, verify that the node is capable of booting from a USB device and the USB memory device is configured as the primary boot device.

If you are PXE booting, ensure that:

  • The server is configured to network boot.
  • PortFast is configured on the switch ports that lead to the Netmail Store node.

Otherwise, the extended time delay required for listening and learning Spanning Tree states can prevent netboot from delivering the Netmail Store image to the node in a timely manner.

3. The node boots from the USB device, but Netmail Store fails to start.

4. The node begins to boot but reports a “kernel panic” error and stops.

These symptoms usually indicate a hardware compatibility issue with the hardware. Contact Support with the details of your hardware setup.

Using Content Router to List Buckets and Objects

To optionally use Netmail Store Content Router to list the buckets in a domain or objects in a bucket:

1. Find the value of the Castor-System-CID for the child of an object to list. For example, to list all buckets in a domain, INFO an object in the bucket to find the value of the object's Castor-System-CID header. (The Castor-System-CID of the object is the Castor-System-Alias of its parent, the bucket.)

2. Add a Content Router filter rule to search for streams where the value of the Castor-System-CID header matches the value in step 1 and the value of Castor-System-Alias is not null. (The Castor-System-Alias of a named object is null.)

3. Using the SDK, instantiate a metadata enumerator subscribed to the rule channel you created in the preceding step to obtain the stream's metadata.

4. In the metadata returned for the object, look for the value of the Castor-System-Name header.

Configuration

Refer to the following table for help with configuration problems.

SymptomResolution

1. After the system boots, a message appears stating that the configuration file is missing.

2. The node boots, but storage is not available on the node.

3. A hard drive in a node does not appear as available storage.

4. After adding a new hard drive to a node, some of the volumes will not mount.

5. After moving a volume between nodes, some volumes in the new node will not mount.

Ensure that each node has a node.cfg file on the USB stick and all volumes within the

node are specified in the vols option. See "Configuring the Nodes" for information about configuring a node.

The volume must be larger than the minimum value specified by the disk.minGB parameter (64GB by default) or it will not boot. Small disks can be booted by lowering the size value in disk.minGB.

If the vols specification is correct and the volume is larger than disk.minGB, this may be an issue with the amount of RAM in the node. Check the available RAM and ensure that it is provisioned sufficiently.

6. The node boots from the USB device but Netmail Store fails to start.

7. The node begins to boot but reports a “kernel panic” error and stops.

These symptoms usually indicate a hardware compatibility issue with the hardware. Contact Support with the details of your hardware setup.

8. Some changes to the node.cfg file disappear after editing.

If a USB flash drive is removed from a computer without unmounting, some changes can be lost. Use the proper method for your OS to stop and unmount the USB media before removing it.

9. The following alert displays in the Admin Console:

Local clock is out of sync with node ip-address

Also, the following indicator displays for each node where the error occurs:

A clock synchronization issue exists between the cluster nodes.

The clock icon displays next to any node that sends a data packet that is more than three minutes offset from the reporting node's local clock. The local node will also log a critical error.

Ensure that your Network Time Protocol (NTP) settings are correct and the cluster nodes can access the configured NTP server.

10. A node hangs during boot while initializing ACPI services.

System hardware is conflicting with the Advanced Configuration and Power Interface (ACPI) ACPI interface.

To resolve this issue, add the argument acpi=off to the syslinux.cfg file on the USB flash drive (for local booting) or to the PXE configuration file (for network booting).

11.The Netmail Store node boots as having an unregistered license.

The license file is not in the Caringo directory on the USB drive.

OR

The licensFileURL option in the node or cluster configuration file is not set properly.

Operational Problems

Refer to the following table for help with configuration problems.

SymptomResolution
1. A volume device failed.

Allow the node to continue running in a degraded state (lowered storage).

OR

Replace the volume at your earliest convenience.

See “Replacing Failed Drives” for information about replacing a volume.

2. A node failed.

If a node fails but the volume storage devices are functioning properly, you can repair the hardware and return it to service within 14 days.

If a node is down for more than 14 days, all of its volumes are considered stale and cannot be used. After 14 days, you can force a volume to be remounted by modifying the volume specification and adding the :k (keep) policy option.

3. In the Admin Console, all remaining cluster nodes are consistently or intermittently offline.

If you view the Admin Console from another cluster node, the original node appears offline to the second node, and so on. Each node appears as its own island where no other nodes appear reachable.

If a new node cannot see the remaining nodes in the cluster, check the Netmail Store network configuration setting in each node (particularly the group parameter) to ensure that all nodes are configured as part of the same cluster and connected to the same subnet.

If the network configuration appears to be correct, verify that Internet Group Management Protocol (IGMP) Snooping is enabled on your network switch. If enabled, an IGMP querier must be enabled in the same network (broadcast domain). In multicast networks, this is normally enabled on the router leading to the Netmail Store cluster, which is usually the default gateway for the nodes.

About IGMP Queriers

Some routers are configured to act as IGMP queriers in an IPv4 network for multicast group memberships, but other routers are not unless configured appropriately. Since multicast routing is not configured by default on all routers, an IGMP querier may not exist on your network unless you have specifically configured it for this task.

About IGMP Snooping

A switch with IGMP snooping enabled listens for IGMP traffic between hosts and routers. The switch will only forward multicast traffic out to ports where it heard an IGMP join message within a configurable time - typically around 5 minutes.

When a Netmail Store node joins a cluster, it sends an initial unsolicited join request for its configured multicast group. At that point, all Netmail Store nodes are visible from the Admin Console of all other nodes. IGMP queriers periodically send another query to see if there are any hosts still interested in the multicast group. The Netmail Store node will not send another unsolicited join message. If there is no querier for that multicast group in the network, it will stop forwarding multicast traffic for that particular group out of that particular switchport when the switch timer for that multicast group runs out. After the timeout, all Netmail Store nodes appear to be unable to contact each other because the router did not send a query to prompt a subsequent join by the Netmail Store node.

The purpose of IGMP snooping is to alleviate unnecessary multicast traffic from hosts that are not interested in the traffic. It is best practice for Netmail Store nodes to exist in their own private VLAN to prevent other hosts from entering the broadcast domain. As a result, you can disable IGMP snooping from the Netmail Store nodes' VLAN. There is no benefit to having IGMP snooping configured in a VLAN that only includes Netmail Store products.

If disabling IGMP snooping from the VLAN is not an option, you can configure an IGMP querier for your cluster's multicast group(s).

See your router documentation for details. For more information on IGMP Snooping, see RFC 2236.

4. You have read-only access to the Admin Console even though you are listed in security.administrators

5. You cannot view the Admin Console.

You added an operator (a read-only user) to security.operators but did not add your administrator user name and password to security.operators as well. As a result, you cannot access the Admin Console as an administrator.

To resolve this issue, add all of your administrator users to the security.operators parameter in the node or cluster configuration file.

See “Managing Netmail Store Administrators and Users” for more information.

6. The network does not connect to a node configured with multiple network interface controller (NIC) ports.

Ensure that the network cable is plugged into the correct NIC. Depending on the bus order and the order that the kernel drivers are loaded, the network ports may not match their external labeling.

7. A node automatically reboots.

If the node is plugged into a reliable power outlet and the hardware is functioning properly, this issue may indicate a software problem. The Netmail Store system includes a built-in failsafe that will reboot itself if something goes wrong. Contact Support for guidance.

8. A node is unresponsive to network requests.

Perform the following steps until the node responds to network requests.

1. Ensure that your client network settings are correct.

2. Ping the node.

3. Open the Admin Console on the node by entering its IP address in a browser window (for example, http://192.168.3.200:90).

4. Attach a keyboard to the failed node and press Ctr-Alt-Delete to force a graceful shutdown.

5. Press the hardware reset button on the node or power cycle the node.

  • No labels