Netmail Archive 6.x
What role does the index server play?
Netmail leverages a well-known technology to fulfill the indexing role: Solr. This is built around the lucene engine and implemented on the CentOS platform. It plays a crucial role in the entire Archive solution:
- Permits jobs to be selective about which items to act upon (using a policy)
- Enables viewing the contents of the archives
- Performs searches for eDiscovery
A properly spec'd server is paramount to enable the above functionality without undue latency or degraded user experience.
How are the indexes different from my archives?
There is occasionally the misconception that the indexes are, in fact, the archives. This is understandable due to the importance which the indexes play in the overall solution. However the indexes are nothing more than a collection of pointers to your actual data, which is stored elsewhere (admin console -> Archiving -> Storage tab). Consequently it is important to make this distinction: lost indexes can be re-built from the archived data, but lost archive data cannot be rebuilt from the indexes.
How fast can I index documents?
The speed of indexing is directly proportional to the specifications of the environment:
- Number of index & archive servers
- Performance of index & archive servers (particularly CPU/RAM on archive servers, and storage I/O on indexer servers)
- Whether attachments are included, and what their characteristics are
- Network bandwidth
Therefore it is very difficult to quote a number which is applicable to all cases, however it is not uncommon to see rates of 5M+ documents/day if running continuously in a healthy environment.
How are documents distributed?
Indexes are organized as a collection of shards which are distributed evenly across all index servers. All data pertaining to a mailbox is mapped to a single shard, whereas documents crawled from file stores are spread over all the shards. Shards can also be configured to be replicas of each other, thereby increasing resiliency through redundancy.
How does the indexing process work?
Netmail 6.x introduces a new architectural component called the NIPE. This resides on the Archive servers and is also referred to the as Netmail Indexing service in the Microsoft Services console. This component assumes all indexing-related duties on the Archive server, as well as handles all interactions to/from the Index servers.
When an item is indexed it is first pre-processed by the NIPE before being sent to the index servers. The NIPE will take care of
- fetching & opening attachments or embedded email messages
- extracting the text
- analyzing & tokenizing the content
- applying regex for tagging
- selecting the appropriate shard
- pushing the data
Once on the indexers, the data is collected in a buffer and processed in batches. Every 5 minutes a batch is processed and its indexes are available in memory (aka. soft commit). At this point the data is visible in Netmail Search. Every 10 minutes the indexes from memory are flushed to disk (aka. hard commit).
Should the index server seize up for any reason, it is safe to restart without worrying about losing data.
Why are only 2 accounts indexing at a time?
Nearly all Netmail jobs operate using threads in the job monitor and, when required, feed their documents to NIPE for indexing. Typically as is the case with an archive job, there will be 10 threads running per node and they will each submit their items to NIPE during the course of their processing. The only exception is an index job which will "out-source" its work entirely to NIPE. NIPE will take the path to the account and begin to process all the items therein, without the intervention of the thread (hence the status of "Waiting for index to complete" on the threads in the monitor, since they are in fact waiting and not actively involved). However NIPE can only perform this service for 2 index threads at-a-time, thus the rest of the threads must wait their turn.
This design was settled upon after multiple benchmarks, and it is not possible to change the number of threads that NIPE will take on at once. In majority of cases it will make the most use of a server's resources and accomplish the indexing in an optimal time.
How is the solution scalable?
Adding more indexing capacity can be done two ways:
- Scaling up: For moderate increases, the resources of the existing servers can be augmented. Measure consumption of RAM, CPU, and storage and supplement where necessary.
- Scaling out: For larger increases, whole other servers can be added to the indexing cluster. In this scenario, shards from the existing collection must be relocated to the new servers to distribute the load.
In either case, the indexing cluster must be taken offline for the upgrade to be performed.
Is the indexing cluster resilient?
As mentioned above, shards can be configured to be replicas of one another. Having replicas in the collection not only increases resiliency but also improves query performance (ie. Netmail Search responsiveness). As part of the original deployment, a Setup Wizard is run to create the shards/collection. At that point the number of replicas may be specified and the Wizard will take care to distribute the shards intelligently to maximize availability in the event of a failure.
The alternative to live replicas is to perform a traditional backup of the indexing data. The Netmail web console facilitates this by enabling a dump of all the shards to a specified location per schedule.