Skip to end of metadata
Go to start of metadata
On this page:

Basic Architecture

Content Router lives within a Netmail Store cluster but is always visible to other clusters, and perhaps to the external network at large. Consequently, it is assumed that all communication between Content Router services occurs over a secure TCP connection.

The following diagram illustrates, at a high level, the flow of messages that takes place between two Netmail Store clusters, one acting as the Primary Cluster and the other as a Disaster Recovery Cluster. Dotted lines represent local cluster communication. Solid lines represent TCP traffic over the HTTP protocol between Content Router services and also between storage nodes within the two independent clusters.

Alternatively, if network configuration prevents direct communication between the storage nodes (such as when Content Router is installed on a CSN), communications can be configured to route through a SCSP Proxy:

Structure of a Content Router Node

Content Router consists of a server machine running Linux, and it executes one or both of two services:

  • Publisher: Processes all streams stored in a cluster, filters them based on stream metadata, and publishes UUIDs to remote Subscribers.

  • Subscriber: Retrieves UUID publications from remote Publishers. A Replicator, the most common subscriber, retrieves remote UUIDs and sends replication requests to local nodes and must be installed on a server in the same subnet as its target storage cluster. A third-party application, which may or may not be installed on a node with other Content Router services, can also function as a subscriber by integrating with the Enumerator API. For more information on the Enumerator API, see Enumerator API.

Content Router service configuration parameters are used to enable one or both services, depending on the intended network topology. For the simple example in the previous section, where a primary cluster replicates in one direction to a single DR cluster and all nodes in both clusters are mutually visible, the Content Router node in the primary cluster would likely be configured to run only the Publisher service and the Content Router node in the DR cluster would run just the Subscriber. A slightly more complex example would be a pair of mirrored clusters where all nodes in both clusters still have mutual visibility. Content Router servers for this topology would look like the diagram below. For clarity, the assumed direct connections between storage nodes in the two clusters as shown and discussed in the previous example have been omitted.

Similar to the previous proxy-enabled example, if the storage nodes are not able to communicate directly in the configured network topology, the Publisher can be configured to send responses through a SCSP Proxy in a mirrored configuration as well:

The two optional Content Router services are designed and deployed as independent processes running on a server. We discuss each of these services in more detail in the following sections.

Publisher Service

The Publisher service collects a comprehensive list of all the UUIDs stored in the cluster (as well as those that have been deleted), filters those UUIDs, and publishes the resulting lists of UUIDs to one or more remote Subscribers. The Publisher consists of several subcomponents:

  • Simple HTTP server
  • Attached database
  • Filter Rules Engine

Filtering UUIDs for Publication

The Publisher traverses the list of UUIDs and evaluates one or more existence validations or simple filter expressions against the values of certain headers in the metadata. A set of matching rules can be configured by the administrator that will determine the topology of the intercluster network. These rules are specified using an XML syntax, the full definition of which can be found below.

By way of an example, suppose we want to configure our local cluster to remotely replicate all high and medium priority streams to a primary disaster recovery site, while all other streams get replicated to a secondary DR site. The General XML rule structure to do this might look like this:

<rule-set>
  <publish>
    <select name="PrimaryDR"/>
    <select name="SecondaryDR"/>
  </publish>
</rule-set>

The example above is a good starting point, but it, alone, will not perform the filtering necessary for this example. In order to select the PrimaryDR cluster as the destination of some of the locally stored streams, we want to find all streams whose content metadata contains a header called “CAStor-priority” whose value starts with either a “1”, a “2”, or one of the words “high” or “medium”. Note that the header name is not case sensitive, but the actual header value with a match expression is case sensitive. Here is a select rule that uses a filter with a matches() expression that would accomplish this:

<select name="PrimaryDR">
  <filter header="storageProduct/>-priority">
    matches('\s*[12].*|\s*[Hh]igh.*|\s*[Mm]edium.*')
  </filter>
</select>

A select clause specifies a pattern for a single set of data to be retrieved by the Subscriber process by name. The select clause can contain zero or more filter clauses, and here there is just one. If there are multiple filter clauses, then all of them must match a stream’s metadata before the stream is published. As in HTTP, the order of headers within the metadata is not significant. If there are multiple headers in the stream metadata with the given header-name, then any of them can match the given pattern in order for the select to fire. If there are no filter clauses, then the select matches any and every stream, as in the following:

<select name="SecondaryDR">
</select>

The root tag for a set of Content Router rules is called rule-set, which can contain one or more publish tags as shown above. The example rule set below will replicate all high and medium priority streams to the PrimaryDR cluster and all others to the SecondaryDR cluster. It will also send all streams whose Content-Disposition header does not contain a file name ending with “.tmp” to the Backup cluster.

<rule-set>
  <publish>
    <select name="PrimaryDR">
      <filter header="storageProduct/>-priority">
        matches('\s*[12].*|\s*[Hh]igh.*|\s*[Mm]edium.*')
      </filter>
    </select>
    <select name="SecondaryDR">
    </select>
  </publish>
  <publish>
    <select name="Backup">
      <filter header="content-disposition">
         not matches('.*filename\s*\=.*\.tmp.*')
      </filter>
    </select>
  </publish>
</rule-set>

Notice that a rule-set can contain multiple publish clauses, and each publish clause can contain multiple select clauses. The Filter Rules Engine evaluates all content elements for each publish clause. In the example above, where there are two publish clauses, all content streams can be queued for remote replication once for each publish. In addition, when there are two select clauses in a given publish clause, the content metadata is evaluated against each select clause’s filter set. The select clauses are evaluated in order from top to bottom. When the rules engine finds a select whose filter clauses all evaluate to true, the content stream is placed in the appropriate queue (i.e., PrimaryDR or SecondaryDR) and awaits remote replication. The evaluation of that publish clause is complete, and the rules engine begins evaluation of the next publish clause. When all publish clauses have been evaluated for a given content stream, then the Filter Rules Engine begins evaluation for the next content stream.

Rules

The full syntax for the filter rules of a Publisher is presented in simplified RELAX-NG Compact Syntax.

start = RuleSet
RuleSet = element rule-set {
Publish+
}
Publish = element publish {
Select+
}
Select = element select {
(Filter|Exists|NotExists)+,
attribute name { text }
}
Filter = element filter {
HeaderAttr|LifepointAttr),
# filter expression, using olderThan(), matches() etc.
text
}
Exists = element exists {
HeaderAttr
}
NotExists = element not-exists {
HeaderAttr
}
HeaderAttr = attribute header { text }

Exists and Not-Exists are just tests to check if the header is present or not. An empty header will match an exists query.

Filter expressions are built using a small set of functions. The set of functions available to a filter are:

  • matches(regexstr) or contains(regexstr) - Matches any part of the header value to a given regular expression.
  • olderThan(dateSpec) - Matches if the header value is a date and that date is older than the date given, which may be either an absolute date or a relative (to execution time) date.
  • intValue(int) - Matches if the header value is an integer and executes the specified comparison against that integer (greater than, less than, equal to, etc.).

Replicator Service

A Content Router node’s Replicator service serves as a subscriber to one remote cluster’s Publisher service. The Replicator’s purpose is to receive UUIDs from the Publisher service, and then schedule those UUIDs for replication (or deletion) in its local cluster. The Replicator can be configured to periodically poll, using an HTTP GET request, one or more remote Publishers to obtain a list of UUIDs to be replicated or deleted. Upon receiving these lists, the Replicator immediately writes the UUIDs to a queue on its local disk. Next the Replicator attempts to process each write or delete, and when completed the UUID is removed from the queue.

Typical boolean and grouping operators are also available, so it is possible to easily construct more sophisticated filters (e.g., (olderThan('365d') and not olderThan('730d')) or matches('^Mon\s.*') to express “between one and two years old or on Mondays”).

  • No labels