Redundancy and Reliability

Luna HSMs have two available configuration models for redundancy and reliability:

>High-Availability Groups

>Clusters

High-Availability Groups

Luna HSMs can provide scalability and redundancy for cryptographic applications that are critical to your organization. For applications that require continuous, uninterruptible uptime, the Luna HSM Client allows you to combine application partitions on multiple HSMs into a single logical group, known as a High-Availability (HA) group.

An HA group allows your client application to access cryptographic services as long as one member HSM is functional and network-connected. This allows you to perform maintenance on any individual member without ever pausing your application, and provides redundancy in the case of individual failures. Cryptographic requests are distributed across all active group members, enabling a performance gain for each member added. Cryptographic objects are replicated across the entire group, so HA can also be used to keep a current, automatic, remote backup of the group contents.

HA functionality is handled by the Luna HSM Client software. The individual partitions have no way to know they are configured in an HA group, so you can configure HA on a per-application basis. The way you group your HSMs depends on your circumstances and desired performance.

Performance

For repetitive operations (for example, many signings using the same key), an HA group provides linear performance gains as group members are added. The best approach is to maintain an HA group at a size that best balances application server capability and the expected loads, with an additional unit providing capacity for bursts of traffic.

Load Balancing

Cryptographic requests sent to the HA group's virtual slot are load-balanced across all active members of the HA group. The load-balancing algorithm sends requests for cryptographic operations to the least busy partition in the HA group. This scheme accounts for operations of variable length, ensuring that queues are balanced even when some partitions are assigned very long operations. When an application requests a repeated set of operations, this method works. When the pattern is interrupted, however, the request type becomes relevant, as follows:

>Single-part (stateless) cryptographic operations are load-balanced.

>Multi-part (stateful) cryptographic operations are load-balanced.

>Multi-part (stateful) information retrieval requests are not load-balanced. In this case, the cost of distributing the requests to different HA group members is generally greater than the benefit. For this reason, multi-part information retrieval requests are all targeted at one member.

> Key management requests are not load-balanced. Operations affecting the state of stored keys (creation, deletion) are performed on a single HA member, and the result is then replicated to the rest of the HA group.

Key Replication

Objects (session or token) are replicated immediately to all members in an HA group when they are generated in the virtual HA slot. Similarly, deletion of objects (session or token) from the virtual HA slot is immediately replicated across all group members. Therefore, when an application creates a key on the virtual HA slot, the HA library automatically replicates the key across all group members before reporting back to the application. Keys are created on one member partition and replicated to the other members. If a member fails during this process, the HA group reattempts key replication to that member until it recovers, or failover attempts time out. Once the key exists on all active members of the HA group, a success code is returned to the application.

NOTE   If you are using Luna HSM Client 10.4.0 or newer and are setting up an HA group with a mix of FIPS and non-FIPS partitions as members, objects will not replicate across all HSMs in the group in the following cases:

>If you have set a non-FIPS primary, a FIPS secondary, and created a non-FIPS approved key on the group, the key will not replicate to the FIPS secondary. No error is returned when this occurs.

>If you synchronize group members with the hagroup synchronize LunaCM command, any non-FIPS keys will fail to replicate to the FIPS member(s). An error is returned when this occurs, but lunaCM synchronizes everything else.

NOTE    If your application bypasses the virtual slot and creates or deletes directly in a physical member slot, the action occurs only in that single physical slot, and can be overturned by the next synchronization operation. For this reason we generally advise to enable HA-only, unless you have specific reason to access individual physical slots, and are prepared (in your application) to perform the necessary housekeeping.

Key replication, for pre-firmware-7.7.0 HSM partitions and for V0 partitions, uses the Luna cloning protocol, which provides mutual authentication, confidentiality, and integrity for each object that is copied from one partition to another. Therefore, prior to Luna HSM Firmware 7.8.0, all HA group member partitions must be initialized with the same cloning domain.

Key replication, for Luna HSM Firmware 7.8.0 (and newer) HSM partitions and for V0 partitions, and Luna HSM Client 10.5.0 (and newer), becomes more versatile with Extended Domain Management, as each member partition can have as many as three cloning/security domains. It becomes possible to easily mix password-authenticated and multi-factor (PED) authenticated partitions in HA groups. Any member must have at least one of its domains in common with the current primary member. [For reasons of redundancy and overlap, we recommend that you not create (say) a 4-member group where the primary has domains A, B, C, and the three secondary members include one member with domain A, one member with domain B, and one member with domain C, where no other domains belong to the group -- such a group could function only until the primary failed/went-offline, at which point the next primary would have no domain peers with which to synchronize. Therefore, consider redundancy overlap when using Extended Domain Management with HA group members.


Key replication for V1 partitions uses the Luna cloning protocol to ensure that all HA group members have the same SMK, and uses SKS to export a key originating at one member and to import and decrypt that key (using the common SMK) on each other member in the group. Again, all HA group member partitions must be initialized with the same cloning domain in order that the common SMK can be available on every member.

Failover

When any active HA group member fails, a failover event occurs – the affected partition is dropped from the list of available HA group members, and all operations that were pending on the failed partition are transparently rescheduled on the remaining member partitions. The Luna HSM Client continuously monitors the health of member partitions at two levels:

> network connectivity – disruption of the network connection causes a failover event after a 20-second timeout.

>command completion – any command that is not executed within 20 seconds causes a failover event.

As long as one HA group member remains functional, cryptographic service is maintained to an application no matter how many other group members fail.

Recovery

Recovery of a failed HA group member is designed to be automatic in as many cases as possible. You can configure your auto-recovery settings to require as much manual intervention as is convenient for you and your organization. In either an automated or manual recovery process, there is no need to restart your application. As part of the recovery process:

>Any cryptographic objects created while the member was offline are automatically replicated to the recovered partition.

>The recovered partition becomes available for its share of load-balanced cryptographic operations.

Standby Members

After you add member partitions to an HA group, you can designate some as standby members. Cryptographic objects are replicated on all members of the HA group, including standby members, but standby members do not perform any cryptographic operations unless all the active members go offline. In this event, all standby members are immediately promoted to active service, and operations are load-balanced across them. This provides an extra layer of assurance against a service blackout for your application.

Mixed-Version HA Groups

Generally, Thales recommends using HSMs with the same software/firmware in HA groups; different versions have different capabilities, and a mixed HA group is limited to those functions that are common to the versions involved. A mixed-version HA group may have access to fewer cryptographic mechanisms, or have different restrictions in FIPS mode. However, HA groups containing both Luna 6 and 7 partitions and Luna Cloud HSM services are supported. This mixed-version configuration is useful for migrating keys to a new Luna 7 HSM or the cloud, or to gradually upgrade your production environment from Luna 6 to Luna 7.

Clusters

Luna Network HSM 7 now allows you to store your cryptographic objects in an encrypted cluster within the appliance memory. This process uses Scalable Key Storage to encrypt the cluster and the SMK is shared with all member HSMs. The cluster contains keyrings, which are analogous to application partitions and can be accessed by a client in much the same way, by connecting to any member appliance. Keys are retrieved from the cluster, decrypted within the secure confines of the HSM, and used by the HSM for cryptographic operations. This configuration allows you to store many more keys than you can normally store within HSM partitions. The management of backup and restore operations is greatly simplified; the HSM administrator can back up the full content of a cluster, at scheduled intervals or on demand.

A cluster can consist of one Luna Network HSM 7 member appliance, or multiple appliances that share the contents of the cluster. Adding multiple members to a cluster improves performance, and provides redundancy and failover for your client applications. Thales recommends a maximum of 4 members per cluster.

Up to 3500 keyrings can be created on the cluster, and each keyring can contain up to 256 objects. Each Luna HSM Client can manage up to 3500 keyrings, which can be spread across multiple clusters.

Customized Load-Balancing

Luna Network HSM 7s within a cluster can be added to an affinity group. Operations from a connected client application are load-balanced between members of the same group only. This allows you to use the other members, which can be at a remote location with greater latency, as backup or standby members for a specific client. If all members of a client's preferred group are unavailable, operations then fail over to other members of the cluster. The state of keyrings and objects stored on them is always synchronized across all members of the cluster, regardless of group. You can create up to 64 affinity groups in a cluster.