M-Link Archive and Search
This whitepaper describes the new Archive capability in M-Link, and how it enables searching of archive data by end users and operators. It describes operator and management capabilities, including archiving, search, storing statistics history and how very long term archiving can be achieved using PDF/A storage.
The diagram below illustrates how search and archive have been implemented for M-Link. A new M-Link Archive process has been added, which will usually run on the same machine as the core M-Link server. In a clustered configuration, an M-Link Archive process will run on each cluster node, and there is an archive clustering protocol that connects the M-Link Archive processes, so that each archive process holds information from each cluster node.
End user access to the cluster is provided using the Message Archive Management (MAM) protocol specified in XEP-0313. MAM operates over the standard XMPP Client to Server protocol. XMPP clients that implement MAM can directly access M-Link to search and access XMPP archive information. Isode provides a Web Application that can run in any modern browser. This M-Link MAM Web Application connects to M-Link using BOSH, to enable end user access to XMPP archives.
M-Link Console, the M-Link Management GUI, connects directly to an M-Link Archive process to provide operator access to the archives for other search and archive management capabilities.
Types of Archiving
There are a number of distinct reasons for server side archiving of XMPP traffic. Most users will only require a subset of these capabilities. These requirements are grouped by four time periods. The requirements for the fourth period (very long term) are distinct from the first three and will need to be addressed by a special mechanism.
Short Term
This is typically useful in a timeframe of minutes to a few days. There are a number of situations where short term archive is of particular use:
- XMPP clients without local message history. Many XMPP clients hold local history, but some (in particular, web clients) do not. Server side archive is an ideal way to provide message history for such a client.
- Switching clients. If a user switches client (say from mobile to desktop), there will be no message history, and server side archives will be ideal to find out what has gone on.
- Multi-User Chat (MUC) history exhaustion. MUC rooms hold a configurable amount of history, so that those joining a MUC room can see what has gone on. The amount of history held is a tradeoff between providing a useful amount of context to new joiners and avoiding excess history. There will be situations where a new joiner wished to find messages older than those held in the MUC history. A server side archive enables this.
- Operator access to end user traffic.
Medium Term
In a longer timeframe, which may be months or even years, client and MUC history will not be present (or at least cannot be relied on), although this data may still be of operational interest to users. Server side archives will allow users and operators to access information on old traffic.
Long Term
Long term archive provides access to archive data after it is no longer of operational interest. Where XMPP has been used for critical communication, it will often be important to have flexible access to archives, with archive access capabilities similar to medium term archive. This will typically be a timeframe of years.
Very Long Term
Long term archive information may have historical or legal interest for tens of years and may be considered as operational record keeping. For example a military mission might archive all information for a legally required period after the end of the mission. It is unrealistic to expect or require that special purpose software be maintained over this timeframe to read archive information. For this reason, very long term archive requires storage in a standard format where there is a high level of confidence that it can be read in the future using tools that will be available in this timeframe.
Archiving Solution Goals
The following goals are identified for the M-Link archiving solution:
- To support all four types of archiving identified, noting that in practice the requirements for the first three overlap.
- To provide UIs to access the archives with appropriate interfaces for the various use types.
- To deal effectively with long term archiving, when the data volumes are too high for storage on operational servers.
- To have as much commonality as possible to the overall solution.
- To support access from:
- Users using standard XMPP authentication and getting access control equivalent to the original traffic (i.e., users should only be able to see traffic that they were party to or could have been party to).
- Operators with privileged access to all traffic
- To use open standards where possible.
- To enable an integrated experience with standard XMPP clients and tools.
- To provide an integrated service, so that archiving is simply an M-Link component and does not require any external services or add-ons.
Message Archive Management (MAM)
M-Link provides open standard client access to XMPP archives using “Message Archive Management “ (MAM) specified in XEP-0313. MAM is for use by an XMPP client connected to an XMPP server, authenticated and authorized in the standard way. MAM enables XMPP client access to XMPP archives, and is an ideal component to help achieve end user archive access goals. MAM is designed to provide end user access to archive information. MAM provides a number of capabilities, including:
- Access to the key XMPP message types:
- 1:1 User Chat.
- MUC (Multi-User Chat).
- PubSub (Publish/Subscribe)
- Selection of messages over a date/time range.
- Filtering by user and free form text.
- Paged results to deal with large search results.
The core MAM service is designed to provide listing of archives based on date. M-Link server and clients make use of searching with MAM using XEP-0004 forms, which not all XMPP servers provide.
MAM in XMPP Clients
End user access to server archives is best achieved through end user XMPP client support of MAM. At the time of writing this white paper, support for MAM in XMPP clients for end users is limited. Isode plans to add MAM support to the Swift XMPP Client.
M-Link MAM Web Application
Isode provides a Web interface to MAM, which can be used by end users to access server archives. This is generally useful, but will be particularly helpful in environments where some or all deployed XMPP end user clients do not support MAM.
The M-Link MAP web application runs “in the browser” and connects to M-Link using BOSH. It allows the user to select MUCs or users, and then search over a selected time range, as shown in the screenshot above.
M-Link Archive Server
M-Link provides an archive server process that runs separately from the M-Link core. In a standard configuration, both M-Link core and archive server processes will run as a pair on the same machine. Communication between the processes uses a high performance protocol.
M-Link Archive Server is built on an Isode developed database optimized for XMPP archives called “wabac” (pronounced “way back”). Wabac is designed to be database technology independent, and is built as a data store with an API that could be implemented with different technologies. This will enable Isode to take advantage of new database libraries in the future, although there are no current plans to do so. The first version of wabac uses embedded SQLite database technology. SQLite is transactional and enables excellent performance and scaling for wabac. Online backup is provided by a tool associated with M-Link (Archive). A wabac database is built on a single SQLite db.
The above diagram shows how the basic archiving process works. M-Link (Core) selects messages to be archived, covering all types of traffic, and sends them to M-Link (Archive). This ensures full archiving. The archiving mechanism is asynchronous, and so M-Link (Core) will not block on M-Link (Archive) and database writes, so that XMPP traffic can be switched quickly even when there are spikes in traffic load. M-Link (Core) writes queued messages to disk until they have been transferred to M-Link (Archive) and acknowledged, so that archiving is resilient to outages and restarts of both M-Link server processes.
The above diagram shows how M-Link services a MAM request from an XMPP client. This works as follows:
- User formulates a request (e.g., search of archive) in an XMPP client, and sends the request as a MAM request over an established and authenticated XMPP connection from the XMPP client to M-Link.
- M-Link (Core) applies identity based access control to the MAM request and then generates an IPC request to M-Link (Archive) over an established TCP connection.
- M-Link (Archive) sends a response back to M-Link (Core).
- M-Link (Core) applies security label based checks to the response from M-Link (Archive) and returns the filtered result to the XMPP client using MAM.
This architecture enables MAM clients to query the M-Link archive. From the client standpoint, the M-Link server simply provides a MAM service.
M-Link Archive Clustering
Archiving in a clustered environment requires each M-Link (Archive) process to hold information on all archived data, even though messages will typically only be handled by one cluster node. The reason for this is:
- Archive searches are best served by just a single archive server to provide simplicity and good performance.
- Archive searches should work in the event of one cluster node failing.
This is addressed by M-Link in the following way:
- Each M-Link (Core) cluster node has its own M-Link (Archive) process, to which all messages handled by the node are archived. Archiving in a local process has good performance and resilience characteristics.
- M-Link (Core) servers are clustered to ensure service reliability using the M-Link clustering protocol. This shares routing information, and some messages. Messages are only shared between M-Link (Core) nodes when necessary.
- Each M-Link (Archive) process database will hold all messages for the entire cluster. This means that MAM queries to a cluster node will return results for the whole cluster.
- The M-Link (Archive) processes use an archive clustering protocol to achieve this, and to ensure that messages are fully replicated. This will deal with M-Link (Archive) and cluster node failures.
- In a cluster with multiple cluster nodes, the M-Link (Archive) process is directly connected to the M-Link (Archive) process on ever other cluster node (i.e., the archiver clustering has the same full mesh topology as the M-Link core server clustering).
This paper has so far described Isode’s recommended M-Link clustering configuration. An alternative approach is also possible, illustrated below:
In this architecture, a single M-Link (Archive) process is used. It can be run co-resident with one M-Link cluster node or on a separate server. This architecture means that archive access has a single point of failure (the single M-Link (Archive) process).
Data Stored in M-Link Archive
M-Link archives all messages and selected state changes, including:
- 1:1 chat messages.
- MUC messages.
- MUC configuration changes (e.g. subject changes, room creation/destruction etc.).
- Pubsub messages, which can include any M-Link Statistics that are associated with a pubsub domain, as well as FDP templates and published forms (see below).
Essentially, messages handled by an M-Link server and key state changes are archived in wabac. It is likely that configuration options will be provided to enable selective archiving. There will be flexible search access to all messages.
IQ stanzas are not archived, as this would be a huge amount of data to archive with very little value to retain long term. However, where IQ stanzas lead to server actions (e.g., deletion of a MUC room) information on the action is archived, so the key changes are recorded.
An XMPP server switches messages, and so M-Link archives the messages switched. Some XMPP clients will group messages into “conversations”, although this is not a primary XMPP concept. Similarly, a tool retrieving messages from an archive could choose to group and display archived messages as conversations.
Three types of archived data are now considered in more detail.
MUC Membership
It is useful to know which users were in a MUC room at a given time. Because all leaves/joins are recorded and stored in the M-Link archive, this can be calculated.
Form Discovery & Publishing
Form Discovery & Publishing (FDP) standardized in XEP-0346 enables publishing and sharing of forms. The FDP technology and the M-Link implementation is described in the whitepaper [Military Forms using XMPP].
FDP stores form templates and published forms in PubSub. These are archived in the M-Link Archive as part of general PubSub archiving capability. The archive is structured so that forms can be searched for.
Statistics
M-Link uses PubSub to hold key statistics values used by M-Link management tools. A key benefit of this approach is that management tools can subscribe to the statistics and be updated as values change. This enables management tools to have the most up to date information and eliminates the need for polling.
Statistics stored in PubSub are archived by M-Link as part of the general purpose PubSub archiving. This archiving enables M-Link statistics management tools to access historical statistics data. The operation of these tools is described in more detail later in this paper.
Operator Access to M-Link Archive
End users will access information stored in M-Link archive using MAM and standard XMPP client access. This approach is also used for some Isode management tools (e.g., statistics) where requirements are met by MAM.
However, there are a number of required operator capabilities that cannot be achieved by MAM. To address this, direct protocol access to M-Link (Archive) is provided over HTTP or HTTPS. This Isode internal protocol makes use of JSON (JavaScript Serialized Object Notation) objects to communicate information. The protocol is documented in the M-Link Administration Manual, so that third party components can access the M-Link Archive directly if needed.
Direct access to M-Link (Archive) is authenticated, to ensure that only authorized users access the data, and will generally be operated over TLS using HTTPS. The M-Link archive can only be accessed directly by configured M-Link operators. MAM (indirect) access to M-Link (Archive) applies full identity and security label based access control to archive access. Users connecting directly to M-Link Archive will have access to all of the data stored in the archive. This means that care should be taken in granting direct access to M-Link Archive. It may also make sense to use IP level controls (e.g., in Firewall) to limit the locations from which the M-Link Archive can be directly accessed.
The rest of this section looks at operator archive capabilities available through M-Link Console.
Archive Configuration
M-Link (Archive) server is created as part of the default M-Link setup. Archiving is configured from M-Link Console, using XEP-0050 Ad Hoc to M-Link (Core). This configuration is shown in the screenshot above. This enables configuration of the archive service, and selection of archive functionality desired on a per domain basis.
Search and Search Refinement
M-Link Console provides operator search access, which goes directly to M-Link (Archive) and thus has access to all search data. The screenshot above shows how flexible searching can be achieved, looking for data by date range and location.
Once an initial search result is achieved, this search can be refined in two ways:
- Searches can be performed within the results returned, to help find a specific record.
- The operator can “zoom” on a specific result, to show the messages before and after. This allows the operator to use search to find a message, and then to view it “in context”.
Recording Search Results
In operations exchanging mission critical data, it is sometimes desirable for the operator to record search results for the long term record. M-Link Console supports this by enabling search results to be saved in PDF/A format. PDF/A is PDF profile specifically designed for long term archive.
In order to enable M-Link deployments with customised reports, property files have been provided to configure PDF metadata, cover page and security label in header/footer.
Redact
In secure deployments, there will be occasions when messages sent to rooms must be removed for security reasons. M-Link Console provides a redaction capability, where the operator may select a message and redact it. Information on the message will remain in the archive, but the text of the message will be replaced with a string that indicates that the original message was redacted.
Other management Functions
There are a number of additional management capabilities. There are two approaches taken with these. Most of these capabilities are available from M-Link Console. This will be useful when these management functions are used on an ad hoc basis. These functions are also available as a command line tool, which can be used to implement scripts that can be run automatically at intervals (e.g., each night). The tool works in exactly the same way as the functionality from M-Link Console over HTTP/HTTPS protocol.
Import/Export
A core capability of M-Link (Archive) is Import and Export. This is available from M-Link Console and as a separate tool. The basic capability is:
- Export will write Archive data to an XML file.
- Import will load data from an XML file.
The format of the XML import/export file is documented in the M-Link Manual, to facilitate integration with other systems. In order to reliably handle very large imports and exports, export works in the following way:
- Client (M-Link Console or Tool) requests an export.
- M-Link Archive generates file on server.
- M-Link Archive reports to client progress on generating the file, and when file creation is complete.
- Client downloads the XML file.
Import works in the opposite way, with the client starting by uploading an XML file for import.
Import will always import the entire file. For export, the data to be exported can either be the full archive or a selected date range.
Backup
Two mechanisms already described may be used for M-Link (Archive) backup:
- Use of a cluster node. Because the archive service is fully clustered, archive data will be held on other cluster nodes in the event of node failure. This may provide sufficient backup.
- The XML import/export may be used for backup.
M-Link also provides a special backup tool. This works in exactly the same way as the export tool, except that only full backup is offered and the format is one optimized for backup. This is Isode’s recommended approach to backup.
Migration
The XML import/export capability provides a flexible approach for migration to and from M-Link XMPP archives, as third party archive formats can be mapped with the XML files.
M-Link provides an archive mechanism to local XML files, which was the only archive mechanism in versions of M-Link prior to Release 16.3. This option is available in M-Link R16.3, although we anticipate that customers will generally use the M-Link archive.
Isode provides a tool to migrate from these older archives to the new M-Link (Archive) database. This functions by mapping the old archives to the XML import format and then importing the XML.
Data Expiry
A tool is provided in M-Link to expire data from the archive, which operates over the JSON/HTTP protocol. This is available from M-Link Console and as a tool which can be run automatically. This tool is used to remove data older than a specific age.
Standalone Operation for Long Term Archive
The description so far has focused on archiving as part of an operational service. The “Long Term Archive” requirement is addressed by keeping data in an M-Link Server and using the M-Link search tools to access this data. This enables the full functionality of M-Link archive access, which is not available in the “Very Long Term Archive” approach described next.
Where it is desirable to hold only a limited archive in the operational server (e.g., the last three months of data) a separate M-Link server can be established to hold the long term archive. This would allow archive access independent to the operational service. This service would be managed as follows, on a periodic basis:
- Export data from operational service to XML file.
- Import the XML file to the long term archive server.
- Expire old data from the operational service.
This process can be achieved manually using M-Link Console or automated using the M-Link provided tools.
Very Long Term Archiving
Very long term archives make use of PDF/A files. It is desirable to have the archive files reasonably large, so that an archivist examining the files in the future does not have too many files to look at, and can make use of PDF search capabilities. This needs to be balanced against files being too large.
The approach taken with the M-Link PDF/A archive is to archive all data for a configurable period. This might be every day for a busy service and every week for a less busy service. This allows searching across the entire archive. Indexes are provided by User and MUC to enable navigation to specific parts of the archive, which may be more convenient in some cases.
PDF/A very long term archive files can be generated from M-Link Console which will be useful for occasional very long term archive generation. The PDF/A files can also be generated from a command line tool, which can be used for nightly update of the very long term archive.
The PDF/A archive layout can be customised using property files. This will allow customization of things such as:
- Security Label in header and footer.
- Cover sheet material, including
- Icon
- Title
- Informative Text in the form of table
- PDF metadata
Automated Archive
For one-off archive operations the M-Link Console GUI is ideal. It will often make sense to incrementally generate PDF/A very long term archives and/or XML files to enable off-site access to archives. This can be done using a daily or otherwise regularly scheduled script using the M-Link console command line tools.
Statistics Archive
The M-Link archive is used to hold statistics history recorded in PubSub nodes. This section describes three M-Link tools that make use of this archived history. They all use MAM to access history.
M-Link Web Monitoring
M-Link provides a Web Application to monitor M-Link that connects over BOSH. This allows for lightweight monitoring of M-Link without the need to install any software, to complement the general purpose management and monitoring capability provided by M-Link Console. It provides live data to give a concise summary of the operational state of a single M-Link service. We plan to also include historical statistics from the archive in a future version.
M-Link Console Statistics
M-Link Console provides flexible monitoring of large number of operational parameters, of one or more M-Link services. M-Link console can show live statistics, which will include appropriate historical statistics retrieved from M-Link (Archive) using MAM. M-Link Console can also show historical statistics for a selected period, drawing all statistics from the archive.
Statistics Reports
The historical statistics described above can also be used to generate reports in PDF/A format. This is an option in M-Link Console. Security label and cover data in the report can be configured by XML stylesheet.
Conclusions
This paper has looked at the extensive archiving capabilities in Isode’s M-Link, in particular the user/operator search and archiving facilities in M-Link Console and the new MAM web application. The support of very long term archiving and XEP-0004 form search makes M-Link an ideal solution for military deployments, where archives may be legally required for a long period of time.
Moving forward Isode plan to add MAM support to the Swift XMPP client, as well as include historical statistics from the archive in the M-Link MAM web application.