ACID Multi-Master Replication in M-Vault Directory
This whitepaper looks at the approach taken to supporting multi-master replication in the Isode M-Vault directory server product. The paper looks at how ACID (Atomicity, Consistency, Isolation, Durability) database requirements are addressed by the approach taken, and sets the approach in the context of other techniques used in distributed directories.
The ACID Database Requirement
ACID (Atomicity, Consistency, Isolation, Durability) is the classic measure of database transaction reliability, comprising four elements:
- Atomicity: Atomicity requires that each transaction be “all or nothing” and that partial changes never happen.
- Consistency: Consistency means that any transaction will bring the database from one valid state to another.
- Isolation: Isolation means that where two transactions occur in parallel, the resulting state will be the same as if the two transactions had occurred sequentially (possibly in a specific order).
- Durability: Durability means that once a transaction has been committed, it will remain so.
These requirements are considered in this paper.
Directory Technology
Directory services described in this paper are a specialized database technology introduced with the X.500 standards and widely adopted using the LDAP (Lightweight Directory Access Protocol).
Directory Hierarchy & Distribution
LDAP is based on a hierarchical data model, with directory entries having a rich attribute based data structure that supports storage of a wide range of information using a hierarchical object class model for data in each entry. Each entry has a simple “relative distinguished name” that enable directory entries to be arranged in a hierarchy with each entry having a global “distinguished name” to name the entry relative to the root of the hierarchy.
Apart from the hierarchical naming, there is no dependency between entries. This means that changes to the directory do not have complex requirements to deal with relationships between entries. The directory hierarchy (Directory Information Tree (DIT)) can be used to distribute data across multiple servers, so that different sub-trees of the DIT will be stored in different directory servers.
This use of the hierarchy is central to how LDAP and X.500 can be used to provide large scale directory services, as organizations can run separate directory servers which collaborate to provide a single directory service. Use of different servers to handle different parts of the DIT is called “distribution”. Because of the simple hierarchy of the DIT, when a set of (non-replicated) directory servers have been configured to support a given DIT, ACID properties can be considered separately for each server. The hierarchy means that changes impact single servers only and distribution of servers does not lead to requirements to co-ordinate changes.
X.500 & Single Master
In order to achieve resilient operation of a directory service, it is generally desirable to replicate data. X.500 defines a directory replication approach using the DISP (Directory Information Shadowing Protocol) described on the M-Vault Directory Server: Replication & Data Distribution product page. DISP provides a powerful and flexible approach for directory replication, and it is the only open standard for directory replication. DISP is an important element of the Isode directory solution.
DISP uses a single master model, where changes are applied to a single directory server (the master) and then copied (shadowed) to other directory servers holding the same data. This is a straightforward model when considering ACID properties. The master directory server will handle update requests from directory clients, and will need to do this in a manner that ensures ACID behavior (e.g., when handling two update requests that are attempting to modify an entry in inconsistent ways). M-Vault does this by appropriate information locking, to ensure that ACID properties are maintained.
DISP transfers changes from master to shadow, and will ensure that ACID properties are maintained in the shadow. As each of the servers follows ACID properties, in one sense the whole system does.
However, consider the following scenario:
- Directory client modifies a user password (in the master directory).
- Directory client then reads back the password (to verify the change) but connects to a shadow directory server.
- If this read happens before shadowing the update is completed, the client gets back the wrong answer. The data is not Consistent across the whole directory service.
It can be seen that ACID properties do not always apply to the directory service as a whole. This can be a significant operational concern, although in many deployments and situations the replication delays are quite acceptable. When sensitive data is changed, it may be important that it is changed in all locations without delay. Later, we will discuss approaches that can be used to address this.
Single Master vs Multi-Master
An alternative to the Single Master approach where updates are applied to a specific directory server is Multi-Master, which allows changes to be applied to one of several directory servers. Both architectures have merits, and deployments should consider both.
Isode has developed a sophisticated approach for single master deployments with support for disaster recover, described in the whitepaper [M-Vault Failover and Disaster Recovery] . This paper also sets out clearly the benefits of a single master approach.
There are a number of scenarios where multi-master is preferable, including:
- Where it is important to always allow updates and where failover in the event of directory server failure should be automatic. (Single master fail-over needs operator intervention).
- To support a clustered service using a peer to peer service, where each node may make directory updates. A good example of this is a clustered Isode M-Link XMPP service described on the M-Link Reliability page. In this architecture, it makes sense for updates to be applied to the local M-Vault directory server.
- Where it is important to be able to make updates to the same part of the directory from different locations, when there may be network partitioning between the organizations.
Isode has added multi-master support to M-Vault, as we see this as an important option.
Multi-master by Eventual Convergence
Key research on distributed directories was done at Xerox in the 1980s with the Clearinghouse and Grapevine. These systems had a multimaster model where changes are made to a local directory server, and then changes are reconciled subsequently between the servers. Most multi-master directories, including Microsoft Active Directory work in this way. The term “Eventual Convergence” is used to describe this approach.
Eventual convergence is a practical approach that works well in many situations, as changes to directories tend to be isolated and to not conflict.
Eventual convergence is not ACID. The key downside is that conflicting changes can be made at different servers (e.g., a single valued attribute can be modified in different ways by two different clients connecting to two different directory servers). This means that the directory is inconsistent for a while. Then change reconciliation will need to resolve the conflict, which it will do by picking one of the changes. This will resolve the inconsistency, but in a way that means that the change made by one of the clients is reversed (so that a client thinks it made a change, which ends up not being made, strongly violating the Durability part of ACID).
For some uses of directory, this non-ACID behavior is highly undesirable.
Achieving ACID Multi-Master in Normal Operation
In adding multi-master capability to M-Vault, we looked at how we could provide multi-master with ACID characteristics. Network and computer performance and resilience have massively increased since many of the original multi-master directories were developed. This allows us to adopt an approach that would not have been viable in the 1980s.
In a multi-master directory setup, every server participating as a master is directly connected to each of the other master servers. The stages in the update process are shown below.
- A directory client requests an update.
- The directory server obtains appropriate locks. For modifying an entry, an update lock on the entry will be obtained. Other updates such as renames will need different locks. First a local lock will be obtained and then a lock on every other master.
- The change is made on each of the remote masters and the local master.
- The locks are released.
- The client is told that the update has been made.
Important aspects of this for ACID properties are:
- The servers are locked in a way that ensures that the same change is made on each server, so the problems arising with Eventual Convergence are avoided. There is no need for change resolution.
- All servers are updated before the client is told of update success. So once the client has been told that a change is made, all subsequent directory reads will reflect the change.
Note that this approach does not mean that each directory will process changes in the same order. The important point to note is that change processing will only happen in a way that does not impact the eventual state, as entry locking prevents conflicting changes from taking place concurrently.
This approach leads to ACID properties in normal operations. The next sections of this paper consider what happens when components fail.
Server Restart
One situation that will sometimes happen is server restart, which could be planned or as a result of hardware or software problem. On restart, a key characteristic of M-Vault is that its first action will be to synchronize with other master servers and regain consistency (i.e., to capture changes made in other servers while the local server was not operating). It will do this before it answers any client queries or accepts modification requests. This is important, as it ensures that on restart that a multi-master server will only provide answers with “current” data. This is important to maintain ACID properties for the whole directory service.
Network Partition
The hardest problem for a multi-master system is dealing with network partition. This is where there is a network failure and a pair of servers cannot communicate. In general, a server cannot distinguish between network partition and failure of the remote server.
Enforcing Quorum and Retaining ACID
M-Vault provides an option to always retain ACID properties, by requiring that a quorum of directory servers participates in any update. A quorum is a strict majority (e.g.,2 out of 3; 3 out of five; 3 out of 4). Consider a multi-master directory with three servers, and network partition splits into 1 and 2 (or one server fails). The group of 2 is a quorum and updates may be made. The server on its own is less than a quorum and so updates are not made while it is separate.
This model allows updates in the event of server failure, provided that there are at least three servers. Because changes will only be made by a “majority” group of servers, inconsistent changes will not arise. This avoids the need for reconciliation and ACID properties are preserved.
ACID vs Update Availability
The downside of the quorum approach is that network failures will constrain where updates can be applied. If a server is network partitioned, it will refuse updates.
For some deployments, availability to make updates is of higher priority than ensuring ACID properties. We plan to support this in a future release of M-Vault by enabling M-Vault to move to the eventual consistency model in the event of network partition. This will allow servers to accept changes, and conflict resolution will resolve changes when the network partition ends. In this situation, M-Vault will clearly log any cases where it resolves conflicts by rejecting changes made by a client. This will enable operator review of the conflict resolution to ensure best behavior.
In normal operation, M-Vault will provide ACID characteristics. It will be a deployment choice to resolve the trade-off between ACID and update availability.
Operation with Secondary Shadowing
The paper has so far considered a scenario where all directory servers participate in multi-master. In some directories high replication to a large number of servers is important. If only a subset of these servers need to make changes, it is sensible to allow shadowing of directory data to readonly shadow servers. M-Vault allows this, using the X.500 DISP protocol, so that multi-master replication can be used in conjunction with open standard shadow replication.
Conclusions
This whitepaper has described the M-Vault approach to provision of multi-master directory. This enables ACID characteristics in normal operation, and allows a choice of ACID characteristics or high update availability in the event of network partition. It also supports secondary shadowing for highly replicated directory services.