That Cluster Guy

Savaged by Softdog, a Cautionary Tale

2019-10-11T13:55:00+11:00

Hardware is imperfect, and software contains bugs. When node level failures occur, the work required from the cluster does not decrease - affected workloads need to be restarted, putting additional stress on surviving peers and making it important to recover the lost capacity.

Additionally, some of workloads may require at-most-one semantics. Failures affecting these kind of workloads risk data loss and/or corruption if ”lost” nodes remain at least partially functional. For this reason the system needs to know that the node has reached a safe state before initiating recovery of the workload.

The process of putting the node into a safe state is called fencing, and the HA community generally prefers power based methods because they provide the best chance of also recovering capacity without human involvement.

There are two categories of fencing which I will call direct and indirect but could equally be called active and passive.

Direct methods involve action on the part of surviving peers, such interacting with an IPMI or iLO device, whereas indirect methods rely on the failed node to somehow recognise it is in an unhealthy state and take steps to enter a safe state on its own.

The most common form of indirect fencing is the use of a watchdog. The watchdog’s timer is reset every N seconds unless quorum is lost or the part of the software stack fails. If the timer (usually some multiple of N) expires, then the the watchdog will panic (not shutdown) the machine.

When done right, watchdogs can allow survivors to safely assume that missing nodes have entered a safe state after a defined period of time.

However when relying on indirect fencing mechanisms, it is important to recognise that in the absence of out-of-band communication such as disk based heartbeats, surviving peers have absolutely no ability to validate that the lost node ever reaches a safe state, surviving peers are making an assumption when they start recovery. There is a risk it didn’t happen as planned and the cost of getting it wrong is data corruption and/or loss.

Nothing is without risk though. Someone with an overdeveloped sense of paranoia and an infinite budget could buy all of Amazon, plus Microsoft and Google for redundancy, to host a static website - and still be undone by an asteroid. The goal of HA is not to eliminate risk, but reduce it to an acceptable level. What constitutes an acceptable risk varies person-to-person, project-to-project, and company-to-company, however as a community we encourage people to start by eliminating single points of failure (SPoF).

In the absence of direct fencing mechanisms, we like hardware based watchdogs because as a self-contained device they can panic the machine without involvement from the host OS. If the watchdog fails, the node is still healthy and data loss can only occur through failure of additional nodes. In the event of a power outage, they also loose power but the node is already safe. A network failure is no longer a SPoF and would require a software bug (incorrect quorum calculations for example) in order to present a problem.

There is one last class of failures, software bugs, that are the primary concern HA and kernel experts whenever Softdog is put forward in situations where already purchased cluster machines lack both power management and watchdog hardware.

Softdog malfunctions originating in software can take two forms - resetting a machine when it should not have (false positive), and not resetting a machine when it should have (false negative). False positives will reduce overall availability due to repeated failovers, but the integrity of the system and its data will remain intact.

More concerning is the possibility for a single software bug to both cause a node to become unavailable and prevent softdog from recovering the system. One option for this is a bug in a device or device driver, such as a tight loop or bad spinlock usage, that causes the system bus to lock up. In such a scenario the watchdog timer would expire, but the softdog would not be able to trigger a reboot. In this state it is not be possible to recover the cluster’s capacity without human intervention, and in theory the entire machine is in a state that prevents it from being able to receive or act on client requests - although perhaps not always (unfortunately the interesting parts of the bug are private).

If the customer needs guaranteed reboot, they should install a hardware watchdog.

— Mikulas Patocka (Red Hat kernel engineer)

The greatest danger of softdog, is that most of the time it appears to work just fine. For months or years it will reboot your machines in response to network and software outages, only to fail you when just the wrong conditions are met.

Imagine a pointer error, the kind that corrupts the kernel’s internal structures and causes kernel panics. Rarely triggered, but one day you get unlucky and the area of memory that gets scribbled on includes the softdog.

Just like all the other times it causes the machine to misbehave, but the surviving peers detect it, wait a minute or two, and then begin recovery. Application services are started, volumes are mounted, database replicas are promoted to master, VIPs are brought up, and requests start being processed.

However unlike all the other times, the failed peer is still active because the softdog has been corrupted, the application services remain responsive and nothing has removed VIPs or demoted masters.

At this point, your best case scenario is that database and storage replication is broken. Requests from some clients will go to the failed node, and some will go to its replacement. Both will succeed, volumes and databases will be updated independently of what happened on the other peer. Reads will start to return stale or otherwise inaccurate data, and incorrect decisions will be made based on them. No transactions will be lost, however the longer the split remains, the further the datasets will drift apart and the more work it will be to reconcile them by hand once the situation is discovered.

Things get worse if replication doesn’t break. Now you have the prospect of uncoordinated parallel access to your datasets. Even if database locking is still somehow working, eventually those changes are persisted to disk and there is nothing to prevent both sides from writing different versions of the same backing file due to non-overlapping database updates.

Depending on the timing and scope of the updates, you could get:

only whole file copies from the second writer and loose transactions from the first,
whole file copies from a mixture of hosts, leading to a corrupted on-disk representation,
files which contain a mixture of bits from both hosts, also leading to a corrupted on-disk representation, or
all of the above.

Ironically an admin’s first instinct, to restart the node or database and see if that fixes the situation, might instead wipe out the only remaining consistent copy of their data (asuming the entire database fits in memory). At which point all transactions since the previous backup are lost.

To mitigate this situation, you would either need very frequent backups, or add a SCSI based fencing mechanism to ensure exclusive access to shared storage, and a network based mechanism to prevent requests from reaching the failed peer.

Or you could just use a hardware watchdog (even better, try a network power switch).

A New Fencing Mechanism (TBD)

2018-03-07T13:11:00+11:00

Protecting Database Centric Applications

In the same way that some application require the ability to persist records to disk, for some applications the loss of access to the database means game over - more so than disconnection from the storage.

Cinder-volume is one such application and as it moves towards an active/active model, it is important that a failure in one peer does not represent a SPoF. In the Cinder architecture, the API server has no way to know if the cinder- volume process is fully functional - so they will still recieve new requests to execute.

A cinder-volume process that has lost access to the storage will naturally be unable to complete requests. Worse though is loosing access to the database, as this will means the result of an action cannot be recorded.

For some operations this is ok, if wasteful, because the operation will fail and be retried. Deletion of something that was already deleted is usually treated as a success and re-attempted operations for creating volume will return a new volume. However performing the same resize operation twice is highly problematic since the recorded old size no longer matches the actual size.

Even the safe operations may never complete because the bad cinder-volume process may end up being asked to perform the cleanup operations from its own failures, which would result in additional failures.

Additionally, despite not being recommended, some Cinder drivers make use of locking. For those drivers it is just as crucial that any locks held by a faulty or hung peer can be recovered within a finite period of time. Hence the need for fencing.

Since power-based fencing is so dependant on node hardware and there is always some kind of storage involved, the idea of leveraging the SBD[1] ( Storage Based Death ) project’s capabilities to do disk based heartbeating and poison-pills is attractive. When combined with a hardware watchdog, it is an extremely reliable way to ensure safe access to shared resources.

However in Cinder’s case, not all vendors can provide raw access to a small block device on the storage. Additionally, it is really access to the database that needs protecting not the storage. So while useful, it is still relatively easy to construct scenarios that would defeat SBD.

A New Type of Death

Where SBD uses storage APIs to protect applications persisting data to disk, we could also have one based on SQL calls that did the same for Cinder-volume and other database centric applications.

I therefor propose TBD - “Table Based Death” (or “To Be Decided” depending on how you’re wired).

Instead of heartbeating to a designated slot on a block device, the slots become rows in a small table in the database that this new daemon would interact with via SQL.

When a peer is connected to the database, a cluster manager like Pacemaker can use a poison pill to fence the peer in the event of a network, node, or resource level failure. Should the peer ever loose quorum or its connection to the database, surviving peers can assume with a degree of confidence that it will self terminate via the watchdog after a known interval.

The desired behaviour can be derived from the following properties:

Quorum is required to write poison pills into a peer’s slot
A peer that finds a poison pill in its slot triggers its watchdog and reboots
A peer that looses connection to the database won’t be able to write status information to its slot which will trigger the watchdog
A peer that looses connection to the database won’t be able to write a poison pill into another peer’s slot
If the underlying database looses too many peers and reverts to read-only, we won’t be able to write to our slot which triggers the watchdog
When a peer that looses connection to its peers, the survivors would maintain quorum(1) and write a poison pill to the lost node (1) ensuring the peer will terminate due to scenario (2) or (3)

If N seconds is the worst case time a peer would need to either notice a poison pill, or disconnection from the database, and trigger the watchdog. Then we can arrange for services to be recovered after some multiple of N has elasped in the same way that Pacemaker does for SBD.

While TBD would be a valuable addition to a traditional cluster architecture, it is also concievable that it could be useful in a stand-alone configuration. Consideration should therefor be given during the design phase as to how best consume membership, quorum, and fencing requests from multiple sources - not just a particular application or cluster manager.

Limitations

Just as in the SBD architecture, we need TBD to be configured to use the same persistent store (database) as is being consumed by the applications it is protecting. This is crucial as it means the same criteria that enables the application to function, also results in the node self-terminating if it cannot be satisfied.

However for security reasons, the table would ideally live in a different namespace and with different access permissions.

It is also important to note that significant design challenges would need to be faced in order to protect applications managed by the same cluster that was providing the highly available database being consumed by TBD. Consideration would particularly need to be given to the behaviour of TBD and the applications it was protecting during shudown and cold-start scenarios. Care would need to be taken in order to avoid unnecessary self-fencing operations and that failure responses are not impacted by correctly handling these scenarios.

Footnotes

[1] SBD lives under the ClusterLabs banner but can operate without a traditional corosync/pacemaker stack.

A New Thing

2018-02-16T14:32:00+11:00

I made a new thing.

If you’re interested in Kubernetes and/or managing replicated applications, such as Galera, then you might also be interested in an operator that allows this class of applications to be managed natively by Kubernetes.

There is plenty to read on why the operator exists, how replication is managed and the steps to install it if you’re interested in trying it out.

There is also a screencast that demonstrates the major concepts:

Feedback welcome.

Two Nodes - The Devil is in the Details

2018-02-16T10:52:00+11:00

tl;dr - Many people love 2-node clusters because they seem conceptually simpler and 33% cheaper, but while it’s possible to construct good ones, most will have subtle failure modes

The first step towards creating any HA system is to look for and try to eliminate single points of failure, often abbreviated as SPoF.

It is impossible to eliminate all risk of downtime and especially when one considers the additional complexity that comes with introducing additional redunancy, concentrating on single (rather than chains of related and therefor decreasingly probable) points of failure is widely accepted as a suitable compromise.

The natural starting point then is to have more than one node. However before the system can move services to the surviving node after a failure, in general, it needs to be sure that they are not still active elsewhere.

So not only are we looking for SPoFs, but we are also looking to balance risks and consequences and the calculus will be different for every deployment [1]

There is no downside if a failure causes both members of a two node cluster to serve up the same static website. However its a very different story if it results in both sides independently managing a shared job queue or providing uncoordinated write access to a replicated database or shared filesystem.

So in order to prevent a single node failure from corrupting your data or blocking recovery, we rely on something called fencing.

Fencing

At it s heart, fencing turns a question Can our peer cause data corruption? into an answer no by isolating it both from incoming requests and persistent storage. The most common approach to fencing is to power off failed nodes.

There are two categories of fencing which I will call direct and indirect but could equally be called active and passive. Direct methods involve action on the part of surviving peers, such interacting with an IPMI or iLO device, whereas indirect relies on the failed node to somehow recognise it is in an unhealthy state (or is at least preventing remaining members from recovering) and signal a hardware watchdog to panic the machine.

Quorum helps in both these scenarios.

Direct Fencing

In the case of direct fencing, we can use it to prevent fencing races when the network fails. By including the concept of quorum, there is enough information in the system (even without connectivity to their peers) for nodes to automatically know whether they should initiate fencing and/or recovery.

Without quorum, both sides of a network split will rightly assume the other is dead and rush to fence the other. In the worst case, both sides succeed leaving the entire cluster offline. The next worse is a death match , a never ending cycle of nodes coming up, not seeing their peers, rebooting them and initiating recovery only to be rebooted when their peer goes through the same logic.

The problem with fencing is that the most commonly used devices become inaccessible due to the same failure events we want to use them to recover from. Most IPMI and iLO cards both loose power with the hosts they control and by default use the same network that is causing the peers to believe the others are offline.

Sadly the intricacies of IPMI and iLo devices is rarely a consideration at the point hardware is being purchased.

Indirect Fencing

Quorum is also crucial for driving indirect fencing and, when done right, can allow survivors to safely assume that missing nodes have entered a safe state after a defined period of time.

In such a setup, the watchdog’s timer is reset every N seconds unless quorum is lost. If the timer (usually some multiple of N) expires, then the machine performs an ungraceful power off (not shutdown).

This is very effective but without quorum to drive it, there is insufficient information from within the cluster to determine the difference between a network outage and the failure of your peer. The reason this matters is that without a way to differentiate between the two cases, you are forced to choose a single behaviour mode for both.

The problem with choosing a single response is that there is no course of action that both maximises availability and prevents corruption.

If you choose to assume the peer is alive but it actually failed, then the cluster has unnecessarily stopped services.
If you choose to assume the peer is dead but it was just a network outage, then the best case scenario is that you have signed up for some manual reconciliation of the resulting datasets.

No matter what heuristics you use, it is trivial to construct a single failure that either leaves both sides running or where the cluster unnecessarily shuts down the surviving peer(s). Taking quorum away really does deprive the cluster of one of the most powerful tools in its arsenal.

Given no other alternative, the best approach is normally to sacrificing availability. Making corrupted data highly available does no-one any good and manually reconciling diverant datasets is no fun either.

Quorum

Quorum sounds great right?

The only drawback is that in order to have it in a cluster with N members, you need to be able to see N/2 + 1 of your peers. Which is impossible in a two node cluster after one node has failed.

Which finally brings us to the fundamental issue with two-nodes:

quorum does not make sense in two node clusters, and

without it there is no way to reliably determine a course of action that both maximises availability and prevents corruption

Even in a system of two nodes connected by a crossover cable, there is no way to conclusively differentiate between a network outage and a failure of the other node. Unplugging one end (who’s likelihood is surely proportional to the distance between the nodes) would be enough to invalidate any assumption that link health equals peer node health.

Making Two Nodes Work

Sometimes the client can’t or wont make the additional purchase of a third node and we need to look for alternatives.

Option 1 - Add a Backup Fencing Method

A node’s iLO or IPMI device represents a SPoF because, by definition, if it fails the survivors cannot use it to put the node into a safe state. In a cluster of 3 nodes or more, we can mitigate this a quorum calculation and a hardware watchdog (an indirect fencing mechanism as previously discussed). In a two node case we must instead use network power switches (aka. power distribution units or PDUs).

After a failure, the survivor first attempts to contact the primary (the built-in iLO or IPMI) fencing device. If that succeeds, recovery proceeds as normal. Only if the iLO/IPMI device fails is the PDU invoked and assuming it succeeds, recovery can again continue.

Be sure to place the PDU on a different network to the cluster traffic, otherwise a single network failure will prevent access to both fencing devices and block service recovery.

You might be wondering at this point… doesn’t the PDU represent a single point of failure? To which the answer is “definitely“.

If that risk concerns you, and you would not be alone, connect both peers to two PDUs and tell your cluster software to use both when powering peers on and off. Now the cluster remains active if one PDU dies, and would require a second fencing failure of either the other PDU or an IPMI device in order to block recovery.

Option 2 - Add an Arbitrator

In some scenarios, although a backup fencing method would be technically possible, it is politically challenging. Many companies like to have a degree of separation between the admin and application folks, and security conscious network admins are not always enthusiastic about handing over the usernames and passwords to the PDUs.

In this case, the recommended alternative is to create a neutral third-party that can supplement the quorum calculation.

In the event of a failure, a node needs to be able to see ether its peer or the arbitrator in order to recover services. The arbitrator also includes to act as a tie-breaker if both nodes can see the arbitrator but not each other.

This option needs to be paired with an indirect fencing method, such as a watchdog that is configured to panic the machine if it looses connection to its peer and the arbitrator. In this way, the survivor is able to assume with reasonable confidence that its peer will be in a safe state after the watchdog expiry interval.

The practical difference between an arbitrator and a third node is that the arbitrator has a much lower footprint and can act as a tie-breaker for more than one cluster.

Option 3 - More Human Than Human

The final approach is for survivors to continue hosting whatever services they were already running, but not start any new ones until either the problem resolves itself (network heals, node reboots) or a human takes on the responsibility of manually confirming that the other side is dead.

Bonus Option

Did I already mention you could add a third node? We test those a lot :-)

Two Racks

For the sake of argument, lets imagine I’ve convinced you the reader on the merits of a third node, we must now consider the physical arrangement of the nodes. If they are placed in (and obtain power from), the same rack, that too represents a SPoF and one that cannot be resolved by adding a second rack.

If this is surprising, consider what happens when the rack with two nodes fails and how the surviving node would differentiate between this case and a network failure.

The short answer is that it can’t and we’re back to having all the problems of the two-node case. Either the survivor:

ignores quorum and incorrectly tries to initiate recovery during network outages (whether fencing is able to complete is a different story and depends on whether PDU is involved and if they share power with any of the racks), or
respects quorum and unnecessarily shuts itself down when its peer fails

Either way, two racks is no better than one and the nodes must either be given independant supplies of power or be distributed accross three (or more depending on how many nodes you have) racks.

Two Datacenters

By this point the more risk averse readers might be thinking about disaster recovery. What happens when an asteroid hits the one datacenter with our three nodes distributed across three different racks? Obviously Bad Things(tm) but depending on your needs, adding a second datacenter might not be enough.

Done properly, a second datacenter gives you a (reasonably) up-to-date and consistent copy of your services and their data. However just like the two- node and two-rack scenarios, there is not enough information in the system to both maximise availability and prevent corruption (or diverging datasets). Even with three nodes (or racks), distributing them across only two datacenters leaves the system unable to reliably make the correct decision in the (now far more likely) event that the two sides cannot communicate.

Which is not to say that a two datacenters solution is never appropriate. It is not uncommon for companies to want a human in the loop before taking the extraordinary step of failing over to a backup datacenter. Just be aware that if you want automated failure, you’re either going to need a third datacenter in order for quorum to make sense (either directly or via an arbitrator) or find a way to reliably power fence an entire datacenter.

Footnotes

[1] Not everyone needs redundant power companies with independent transmission lines. Although the paranoia paid off for at least one customer when their monitoring detected a failing transformer. The customer was on the phone trying to warn the power company when it finally blew.

Containerizing Databases with Kubernetes and Stateful Sets

2017-02-15T12:45:00+11:00

The canonical example for Stateful Sets with a replicated application in Kubernetes is a database.

As someone looking at how to move foundational OpenStack services to containers, and eventually Kubernetes in the future, this is great news as databases are very typical of applications with complex boostrap and recovery processes.

If we can successfully show Kubernetes managing a multi-master database natively safely, the patterns would be broadly applicable and there is one less reason to have a traditional cluster manager in such contexts.

TL;DR

Kubernetes today is arguably unsuitable for deploying databases UNLESS the pod owner has the ability to verify the physical status of the underlying hardware and is prepared to perform manual recovery in some scenarios.

General Comments

The example allows for N slaves but limits itself to a single master.

Which is absolutely a valid deployment, but does prevent us from exploring some of the more interesting corner multi-master cases and unfortunately from a HA perspective makes pod 0 a single point of failure because:

although MySQL slaves can be easily promoted to masters, the containers do not expose such a mechanism, and even if they did
writers are told to connect to pod 0 explicitly rather than use the mysql-read service

So if the worker on which pod 0 is running hangs or becomes unreachable, you’re out of luck.

The loss of this worker currently puts Kubernetes in a no-win situation. Either it does the safe thing (the current behaviour) and prevents the pod from being recovered or the attached volume from being accessed, leading to more downtime (because it requires an admin to intervene) than a traditional HA solution. Or it allows the pod to be recovered, risking data corruption if the worker (and by inference, the pod) is not completely dead.

Ordered Bootstrap and Recovery

One of the more important capabilities of StatefulSets is that:

Pod N cannot be recovered, created or destroyed until all pods 0 to N-1 are active and healthy.

This allows container authors to make many simplifying assumptions during bootstrapping and scaling events (such as who has the most recent copy of the data at a given point).

Unfortunately, until we get pod safety and termination guarantees, it means that if a worker node crashes or becomes unreachable, it’s pods are unrecoverable and any auto-scaling policies cannot be enacted.

Additionally, the enforcement of this policy only happens at scheduling time.

This means that if there is a delay enacting the scheduler’s results, an image must be downloaded, or an init container is part of the scale up process, there is a significant period of time in which an existing pod may die before new replicas can be constructed.

As I type this, the current status on my testbed demonstrates this fragility:

# kubectl get pods
NAME                         READY     STATUS        RESTARTS   AGE
[...]
hostnames-3799501552-wjd65   0/1       Pending       0          4m
mysql-0                      2/2       Running       4          4d
mysql-2                      0/2       Init:0/2      0          19h
web-0                        0/1       Unknown       0          19h

As described, the feature suggests this state (mysql-2 to be in the init state while mysql-1 is not active) can never happen.

While such behaviour remains possible, container authors must take care to include logic to detect and handle such scenarios. The easiest course of action is to call exit and cause the container to be re-scheduled.

The example partially addresses this race condition by bootstrapping pod N from N-1. This limits the impact of pod N’s failure to pod N+1’s startup/recovery period.

It is easy to conceive of an extended solution that closed the window completely by trying pods N-1 to 0 in order until it found an active peer to sync from.

Extending the Pattern to Galera

All Galera peers are writable, which makes some aspects easier and others more complicated.

Bootstrapping pod 0 would require some logic to determine if it is bootstrapping the cluster (--wsrep="") or in recovery mode (--wsrep=all:6868,the:6868,peers:6868) but special handling of pod 0 has precedent and is not onerous. The remaining pods would unconditionally use --wsrep=all:6868,the:6868,peers:6868.

pod 0 is no longer a single point of failure with respect to writes, however the loss of the worker it is hosted on will continue to inhibit scaling events until manually confirmed and cleaned up by an admin.

Violations of the linear start/stop ordering could be significant if they result from a failure of pod 0 and occur while bootstrapping pod 1. Further, if pod 1 was stopped significantly earlier than pod 0, then depending on the implementation details of Galera, it is conceivable that a failure of pod 0 while pod 1 is synchronising might result in either data loss or pod 1 becoming out of sync.

Removing Shared Storage

One of the main reasons to choose a replicated database is that it doesn’t require shared storage.

Having mutiple slaves certainly assists read scalability and if we modified the example to use multiple masters it would likely improve write performance and failover times. However having multiple copies of the database on the same shared storage does not provide additional redundancy over what the storage already provides - and that is important to some customers.

While there are ways to give containers access to local storage, attempting to make use of them for a replicated database is problematic:

It is currently not possible to enforce that pods in a Stateful Set always run on the same node.

Kubernetes does have the ability to assign node affinity for pods, however since the Stateful Sets are a template, there is no opportunity to specify a different kubernetes.io/hostname selector for each copy.

As the example is written, this is particularly important for pod 0 as it is the only writer and the only one guaranteed to have the most up-to-date version of the data.

It is possible that to work-around this problem if the replica count exceeds the worker count and all peers were writable masters, however incorporating such logic into the pod would negate much of the benefit of using Stateful Sets.
A worker going offline prevents the pod from being started.

In the shared storage case, it was possible to manually verify the host was down, delete the pod and have Kubernetes restart it.

Without shared storage this is no longer possible for pod 0 as that worker is the only one with the data used to bootstrap the slaves.

The only options are to bring the worker back, or manually alter the node affinities to have pod 0 replace the slave on the worker with the most up-to-date one.

Summing Up

While Stateful Sets may not satisfy those looking for data redundancy, they are a welcome addition to Kubernetes that will require pod safety and termination guarantees before they can really shine. The example gives us a glimpse of the future but arguably shouldn’t be used in production yet.

Those looking to manage a database with Kubernetes today would be advised to use individual pods and/or vanilla ReplicaSets, need the ability to verify the physical status of the underlying hardware and should be prepared to perform manual recovery in some scenarios.

HA for Composible Deployments of OpenStack

2016-07-24T13:20:00+10:00

One of the hot topics for OpenStack deployments is composable roles

the ability to mix-and-match which services live on which nodes.

This is mostly a solved problem for services not managed by the cluster, but what of the services still managed by the cluster?

Considerations

Scale up

Naturally we want to be able to add more capacity easily
Scale down

And have the option to take it away again if it is no longer necessary
Role re-assignment post-deployment

Ideally the task of taking capacity from one service and giving it to another would be a core capability and not require a node be nuked from orbit first.
Flexible role assignments

Ideally, the architecture would not impose limitations on how roles are assigned.

By allowing roles to be assigned on an ad-hoc basis, we can allow arrangements that avoid single-points-of-failure (SPoF) and potentially take better advantage of the available hardware. For example:
- node 1: galera and rabbit
- node 2: galera and mongodb
- node 3: rabbit and mongodb
This also has implications when just one of the roles needs to be scaled up (or down). If roles become inextricably linked at install time, this requires every service in the group to scale identically - potentially resulting in higher hardware costs when there are services that cannot do so and must be separated.

Instead, even if two services (lets say galera and rabbit) are originally assigned to the same set of nodes, this should imply nothing about how either of them can or should be scaled in the future.

We want the ability to deploy a new rabbit server without requiring it host galera too.

Scope

This need only apply to non-OpenStack services, however it could be extended to those as well if you were unconvinced by my other recent proposal.

At Red Hat, the list of services affected would be:

HAProxy
Any VIPs
Galera
Redis
Mongo DB
Rabbit MQ
Memcached
openstack-cinder-volume

Additionally, if the deployment has been configured to provide Highly Available Instances:

nova-compute-wait
nova-compute
nova-evacuate
fence-nova

Proposed Solution

In essance, I propose that there be a single native cluster, consisting of between 3 (the minimum sane cluster size) and 16 (roughly Corosync’s out-of-the-box limit) nodes, augmented by a collection of zero-or-more remote nodes.

Both native and remote nodes will have roles assigned to them, allowing Pacemaker to automagically move resources to the right location based on the roles.

Note that all nodes, both native and remote, can have zero-or-more roles and it is also possible to have a mixture of native and remote nodes assigned to the same role.

This will allow us, by changing a few flags (and potentially adding extra remote nodes to the cluster) go from a fully collapsed deployment to a fully segregated one - and not only at install time.

If installers wish to support it¹, this architecture can cope with roles being split out (or recombined) after deployment and of course the cluster wont need to be taken down and resources will move as appropriate.

Although there is no hard requirement that anything except the fencing devices run on the native nodes, best practice would arguably dictate that HAProxy and the VIPs be located there unless an external load balancer is in use.

The purpose of this would be to limit the impact of a hypothetical pacemaker-remote bug. Should such a bug exist, by virtue of being the gateway to all the other APIs, HAProxy and the VIPs are the elements one would least want to be affected.

Some installers may even choose to enforce this in the configuration, but “by convention” is probably sufficient.

Implementation Details

The key to this implementation is Pacemaker’s concept of node attributes and expressions that make use of them.

Instance attributes can be created with commands of the form:

pcs property set --node controller-0 proxy-role=true

Note that this differs from the osprole=compute/controller scheme used in the Highly Available Instances instructions. That arrangement wouldn’t work here as each node may have serveral roles assigned to it.

Under the covers, the result in Pacemaker’s configuration would look something like:

<cib ...>
  <configuration>
    <nodes>
      <node id="1" uname="controller-0">
        <instance_attributes id="controller-0-attributes">
          <nvpair id="controller-0-proxy-role" name="proxy-role" value="true"/>
...

These attributes can then be referenced in location constraints that restrict the resource to a subset of the available nodes based on certain criteria

For example, we would use the following for HAProxy:

pcs constraint location haproxy-clone rule score=0 proxy-role eq true

which would create the following under the covers:

<rsc_location id="location-haproxy" rsc="haproxy-clone">
  <rule id="location-haproxy-rule" score="0">
    <expression id="location-haproxy-rule-expr" attribute="proxy-role" operation="eq" value="true"/>
  </rule>
</rsc_location>

Any node, native or remote, not meeting the criteria is automatically eliminated as a possible host for the service.

Pacemaker also defines some node attributes automatically based on a node’s name and type. These are also available for use in constraints. This allows us, for example, to force a resource such as nova-evacuate to run on a “real” cluster node with the command:

pcs constraint location nova-evacuate rule score=0 "#kind" ne remote

For deployments based on Pacemaker 1.1.15 or later, we can also simplify the configuration by using pattern matching in our constraints.

Restricting all the VIPs to nodes with the proxy role:

 <rsc_location id="location-haproxy-ips" resource-discovery="exclusive" rsc-pattern="^(ip-.*)"/>

Restricting nova-compute to compute nodes (assuming a standardized naming convention is used):

 <rsc_location id="location-nova-compute-clone" resource-discovery="exclusive" rsc-pattern="nova-compute-(.*)"/>

Final Result

This is what a fully active cluster would look like:

9 nodes configured
87 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
RemoteOnline: [ overcloud-compute-0 overcloud-compute-1
overcloud-compute-2 rabbitmq-extra-0 storage-0 storage-1 ]

 ip-172.16.3.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-192.0.2.17 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 ip-172.16.2.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 ip-172.16.2.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.16.1.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Slaves: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 ip-192.0.3.30 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Slaves: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 rabbitmq-extra-0 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 openstack-cinder-volume (systemd:openstack-cinder-volume): Started storage-0
 Clone Set: nova-compute-clone [nova-compute]
     Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ]
 Clone Set: nova-compute-wait-clone [nova-compute-wait]
     Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ]
 nova-evacuate (ocf::openstack:NovaEvacuate): Started overcloud-controller-0
 fence-nova (stonith:fence_compute): Started overcloud-controller-0
 storage-0 (ocf::pacemaker:remote): Started overcloud-controller-1
 storage-1 (ocf::pacemaker:remote): Started overcloud-controller-2
 overcloud-compute-0 (ocf::pacemaker:remote): Started overcloud-controller-0
 overcloud-compute-1 (ocf::pacemaker:remote): Started overcloud-controller-1
 overcloud-compute-2 (ocf::pacemaker:remote): Started overcloud-controller-2
 rabbitmq-extra-0 (ocf::pacemaker:remote): Started overcloud-controller-0

A small wish, but it would be nice if installers used meaningful names for the VIPs instead the underlying IP addresses they manage.

One reason they may not do so on day one, is the careful co-ordination that some services can require when there is no overlap between the old and new sets of nodes assigned to a given role. Galera is one specific case that comes to mind. ↩

Thoughts on HA for Multi-Subnet Deployments of OpenStack

2016-07-24T13:20:00+10:00

In a normal deployment, in order to direct traffic to the same HAProxy instance, Pacemaker will ensure that each VIP is configured on at most one HAProxy machine.

However in a spine and leaf network architecture, the nodes are in multiple subnets and there may be a limitation that the machines can not be part of a common L3 network that the VIPs could be added to.

Once the traffic reaches HAProxy everything should JustWork(tm) - modulo creating the appropriate networking rules, the problem is getting it to the proxy.

The approach to dealing with this will need to differ based on the latencies that can be gaurenteed between every nodes in the cluster. At Red Hat, we define LAN-like latencies to be 2ms or better - consistently and between every node that would make up the cluster.

Low Latency Links

You have more flexibility in low latency scenarios as the cluster software can operate as designed.

At a high level, the possible ways forward are:

Decide the benefit isn’t worth it and create an L3 network just for the VIPs to live on.
Put all the controllers into a single subnet.

Just be mindful of what will happen if the switch connecting them goes bad.
Replace the HAProxy/VIP portion of the architecture with a load balancer appliance that is accessible from the child subnets.
Move the HAProxy/VIP portion of the architecture into a dedicated 3-node cluster load balancer that is accessible from the child subnets.

The new cluster would need the list of controllers and some health checks which could be as simple as “is the controller up/down” or as complex as “is service X up/down”.

Right now creating a load balancer near the network spine would have to be an extra manual step for users of TripleO. However once composable roles (the ability to mix and match which services go on which nodes) are supported, it should be possible to install such a configuration out of the box by placing three machines near the spine and giving only them the “haproxy” role.

Higher Latency

Corosync has very strict latency requirements of no more than 2ms for any of its links. Assuming your installer can deploy across subnets, the existance of such a link would be a barrier to the creation of a highly available dpeloyment.

To work around these requirements, we can use Pacemaker Remote to extend Pacemaker’s ability to manage services on nodes separated by higher latency links.

In TripleO, the work needed to make this style of deployment possible is already planned as part of our “HA for composable roles” design.

As per option 4 of the low latency case, such a deployment would consist of a three node cluster containing only HAProxy and some floating IPs.

The rest of the nodes that make up the traditional OpenStack control-plane are managed as remote cluster nodes. Meaning instead of a traditional Corosync and Pacemaker stack, they have only the pacemaker-remote daemon and do not participate in leader elections or quorum calculations.

External Load Balancers

If you you wish to use a dedicated load balancer, then the 3-node cluster would just co-ordinate the actions of the remote nodes and not host any services locally.

An installer may conceivably create them anyway but leave them disabled to simplify the testing matrix.

General Considerations

The creation of a separate subnet or set of subnets for fencing is highly encouraged.

In general we want to avoid the possibility for a single network(ing) failure that can take out communication to the both a set of nodes and the device that can turn them off.

Everything is HA a trade-off between the chance of a particular failure occurring and the consequences if it ever actually happens. Everyone will likely draw the line in a different place based on their risk aversion, all I can do is make recommendations based on my background in this field.

Working with OpenStack Images

2016-06-20T12:45:00+10:00

Creating Images

For creating images, I recommend the virt-builder tool that ships with RHEL based distributions and possibly others:

virt-builder centos-7.2 --format qcow2 --install "cloud-init" --selinux-relabel

Note the use of the --selinux-relabel option. If you specify --install but do not include this option, you may end up with instances that treats all attempts to log in as security violations and blocks them.

The cloud-init package is incredibly useful (discussed later) but isn’t available in CentOS images by default, so I recommend adding it to any image you create.

For the full list of supported targets, try virt-builder -l. Targets should include CirrOS as well as several versions of openSUSE, Fedora, CentOS, Debian, and Ubuntu.

Adding Packages to an existing Image

On RHEL based distributions, the virt-customize tool is available and makes adding a new package to an existing image simple.

virt-customize  -v -a myImage --install "wget,ntp" --selinux-relabel

Note once again the use of the --selinux-relabel option. This should only be used for the last step of your customization. As above, not doing so may result in an instance that treats all attempts to log in as security violations and blocks them.

Richard Jones also has a good post about updating RHEL images since they require subscriptions. Just be sure to use --sm-unregister and --selinux-relabel at the very end.

Logging in

If you haven’t already, tell OpenStack about your keypair:

nova keypair-add myKey --pub-key ~/.ssh/id_rsa.pub

Now you can tell your provisioning tool to add it to the instances it creates. For Heat, the template would look like this:

myInstance:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    key_name: myKey

However almost no image will let you log in, via ssh or on the console, as root. Instead they normally create a new user that has full sudo access. Red Hat images default to cloud-user while CentOS has a centos user.

If you don’t already know which user your instance has, you can use nova console-log myImage to see what happens at boot time.

Assuming you configured a key to add to the instance, you might see a line such as:

ci-info: ++++++Authorized keys from /home/cloud-user/.ssh/authorized_keys for user cloud-user+++++++

which tells you which user your image supports.

Customizing an Instance at Boot Time

This section relies heavily on the cloud-init package. If it is not present in your images, be sure to add it using the techniques above before trying anything below.

Running Scripts

Running scripts on the instances once they’re up can be a useful way to customize your images, start services and generally work-around bugs in officially provided images.

The list of commands to run is specified as part of the user_data section of a Heat template or can be passed to nova boot with the --user-data option:

myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      #!/bin/sh -ex

      # Fix broken qemu/strstr()
      # https://bugzilla.redhat.com/show_bug.cgi?id=1269529#c9
      touch /etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned

Note the extra options passed to /bin/sh The e tells the script to terminate if any command produces an error and the x tells the shell to log everything that is being executed. This is particularly useful as it causes the script’s execution to be available in the console’s log (nova console-log myServer).

When Scripts Take a Really Long Time

If we have scripts that take a really long time, we may want to delay the creation of subsequent resources until our instance is fully configured.

If we are using Heat, we can set this up by creating SwiftSignal and SwiftSignalHandle resources to coordinate resource creation with notifications/signals that could be coming from sources external or internal to the stack.

signal_handle:
  type: OS::Heat::SwiftSignalHandle

wait_on_server:
  type: OS::Heat::SwiftSignal
  properties:
    handle: {get_resource: signal_handle}
    count: 1
    timeout: 2000

We then add a layer of indirection to the user_data: portion of the instance definition using the str_replace: function to replace all occurences of “wc_notify” in the script with an appropriate curl PUT request using the “curl_cli” attribute of the SwiftSignalHandle resource.

myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      str_replace:
        params:
          wc_notify:   { get_attr: ['signal_handle', 'curl_cli'] }
        template: |
          #!/bin/sh -ex

          my_command_that --takes-a-really long-time

          wc_notify --data-binary '{"status": "SUCCESS", "data": "Script execution succeeded"}'

Now the creation of myNode will only be considered successful if and when the script completes.

Installing Packages

One should avoid the temptation to hardcode calls to a specific package manager as part of a script as it limits the usefulness of your template. Instead, this is done in a platform agnostic way using the packages directive.

Note that instance creation will not fail if packages fail to install or are already present. Check for any required binaries or files as part of the script.

user_data_format: RAW
user_data:
  #cloud-config
  # See http://cloudinit.readthedocs.io/en/latest/topics/examples.html
  packages:
    - ntp
    - wget

Note that this will NOT work for images that need a Red Hat subscription. There is supposed to be a way to have it register, however I’ve had no success with this method and instead I recommend you create a new image that has any packages listed here pre-installed.

Installing Packages and Running scripts

The first line of the user_data: section (#config or #!/bin/sh) is used to determine how it should be interpreted. So if we wish to take advantage of scripting and cloud-init, we must combine the two pieces into a multi-part MIME message.

The cloud-init docs include a MIME helper script to assist in the creation of complex user_data: blocks.

Simply create a file for each section and invoke with a command line similar to:

python ./mime.py cloud.config:text/cloud-config cloud.sh:text/x-shellscript

The resulting output can then be pasted in as a template and even edited in-place later. Here is an example that includes notification for a long running process:

user_data_format: RAW
user_data:
  str_replace:
    params:
      wc_notify:   { get_attr: ['signal_handle', 'curl_cli'] }
    template: |
      Content-Type: multipart/mixed; boundary="===============3343034662225461311=="
      MIME-Version: 1.0
      
      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/cloud-config; charset="us-ascii"
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename="cloud.config"

      #cloud-config
      packages:
        - ntp
        - wget

      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/x-shellscript; charset="us-ascii"
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename="cloud.sh"
      
      #!/bin/sh -ex

      my_command_that --takes-a-really long-time

      wc_notify --data-binary '{"status": "SUCCESS", "data": "Script execution succeeded"}'

Evolving the OpenStack HA Architecture

2016-06-07T14:06:00+10:00

In the current OpenStack HA architecture used by Red Hat, SuSE and others, Systemd is the entity in charge of starting and stopping most OpenStack services. Pacemaker exists as a layer on top, signalling when this should happen, but Systemd is the part making it happen.

This is a valuable contribution for active/passive (A/P) services and those that require all their dependancies be available during their startup and shutdown sequences. However as OpenStack matures, more and more components are able to operate in an unconstrained active/active capacity with little regard for the startup/shutdown order of their peers or dependancies - making them well suited to be managed by Systemd.

For this reason, a future revision of the HA architecture should limit Pacemaker’s involvement to core services like Galera and Rabbit as well as the few remaining OpenStack services that run A/P.

This would be particularly useful as we look towards a containerised future. It both allows OpenStack to play nicely with the current generation of container managers which lack Orchestration, as well as reduces recovery and downtime by allowing for the maximum parallelism.

Divesting most OpenStack services from the cluster also removes Pacemaker as a potential obstacle for moving them to WSGI. It is as-yet unclear if services will live under a single Apache instance or many and the former would conflict with Pacemaker’s model of starting, stopping and monitoring services as individual components.

Objection 1 - Pacemaker as an Alerting Mechanism

Using Pacemaker as an alerting mechanism for a large software stack is of limited value. Of course Pacemaker needs to know when a service dies but it necessarily takes action straight away, not wait around to see if there will be any others with which it can correlate a root cause.

In large complex software stacks, the recovery and alerting components should not be the same thing because they do and should operate on different timescales.

Pacemaker also has no way to include the context of a failure in an alert and thus no way to report the difference between Nova failing and Nova failing because Keystone is dead. Indeed Keystone being the root cause could be easily lost in a deluge of notifications about the failure of services that depend on it.

For this reason, as the number of services and dependancies grow, Pacemaker makes a poor substitute for a well configured monitoring and alerting system (such as Nagios or Sensu) that can also integrate hardware and network metrics.

Objection 2 - Pacemaker has better Monitoring

Pacemaker’s native ability to monitor services is more flexible than Systemd’s which relies on a “PID up == service healthy” mode of thinking ¹.

However, just as Systemd is the entity performing the startup and shutdown of most OpenStack services, it is also the one performing the actual service health checks.

To actually take advantage of Pacemaker’s monitoring capabilities, you would need to write Open Cluster Framework (OCF) agents ² for every OpenStack service. While this would not take a rocket scientist to achieve, it is an opportunity for the way services are started in a clustered and non-clustered environment to diverge.

So while it may feel good to look at a cluster and see that Pacemaker is configured to check the health of a service every N seconds, all that really achieves is to sync Pacemaker’s understanding of the service with what Systemd already knew. In practice, on average, this ends up delaying recovery by N/2 seconds instead of making it faster.

Bonus Round - Active/Passive FTW

Some people have the impression that A/P is a better or simpler mode of operation for services and in this was justify the continued use of Pacemaker to manage OpenStack services.

Support for A/P configurations is important, it allows us to make applications that are in no way cluster-aware more available by reducing the requirements on the application to almost zero.

However, the downside is slower recovery as the service must be bootstrapped on the passive node, which implies increased downtime. So at the point the service becomes smart enough to run in an unconstrained A/A configuration, you are better off to do so - with or without a cluster manager.

Watchdog-like functionality is only a variation on this, it only tells you that the thread responsible for heartbeating to Systemd is alive and well - not if the APIs it exposes are functioning. ↩
Think SYS-V init scripts with some extra capabilities and requirements particular to clustered/automated environment. It’s a standard historically supported by the Linux Foundation but hasn’t caught on much since it was created in the late 90’s. ↩

Minimum Viable Cluster

2015-10-07T16:51:00+11:00

In the past there was a clear distinction between high performance (HP) clustering and high availability (HA) clustering, however the lines have been bluring for some time. People have scaled HA clusters upwards and HP inspired clusters have been used to provide availability through redundancy.

The trend in providing availablity of late has been towards the HP model - pools of anonymous and stateless workers that can be replaced at will. A really attractive idea but in order to pull it off they have to make assumptions that may or may not be compatible with some peoples’ workloads.

Assumptions are neither wrong nor bad, you just need to make sure they are compatible with your environment.

So when looking for an availablity solution, keep Occam’s razor in mind but don’t be a slave to it. Look for the simplest architecture but then work upwards until you find one that meets the needs of your actual (not ideal) application or stack.

Starting Simple

Application HA is the simplest kind of cluster you can deploy because the cluster and the application are the same thing. It takes care of talking to its peers, checking to see if they’re still online, deciding if it should remain operational (because too many peers were lost) and synchronising any data between itself and peers.

This gives you basic fault tolerance, when a node fails there are other copies with sufficient state to take up the workload.

Galera and RabbitMQ (with replicated queues) are two popular examples in this category.

However when I said Application HA was the simplest, thats only from an admin’s point of view, because the application is doing everything.

Some issues the creators of these kinds of applications think about ahead of time:

Can I assume a node that I can’t see is offline?
What to do when some of the nodes cannot see other ones? (quorum)
What to do when half the nodes cannot see the other half? (split-brain)
Does it matter if the application is still active on nodes we cannot see? (data integrity)
Is there state that needs to synchronised? (replication)
If so, how to do so reliably and in the presence of past and future failures? (reconciliation)

So if you’re looking to create a custom application with similar properties, make sure you can fund the development team you will need to make it happen.

And remember that the reality of those simplifying assumptions will only be apparent after everything else has already hit the fan.

But lets assume the best-case here… if all you need is one of these existing applications, great! Install, configure, done. Right?

Maybe. It might depend on your hardware budget.

Unfortunately (or perhaps not) most companies aren’t Google, Twitter or Bookface. Most companies do not have thousands of nodes in their cluster, in fact getting many of them to have more than two can be a struggle.

In such environments the overhead of having 1?, 2?, 10?!? spare nodes (just in case of a failure - which will surely never happen) starts to represent a significant portion of their balance sheet.

As the number of spare nodes goes down, so does the number of failures that the system can absorb. It is irrelevant if a failure leaves two (or twenty) functional nodes if the load generated by the clients exceeds the application’s ability to keep up.

An overloaded system leads to operation timeouts which generates even more load and more timeouts. The surviving nodes aren’t really functional at that point either.

If the services lived in a widget of some kind (perhaps Docker containers or KVM images), we could have a higher level entity that would make new copies for us. Problem solved right?

Maybe. Is your application truely stateless?

Some are, Memcache is one that comes to mind because its a cache, neither creating nor consuming anything. However even web servers seem to want session state these days, so chances are your application isn’t stateless either.

Stateless is hard.

Where do new instances recover their state from? Who are its peers? A static list isn’t going to be possible if the widgets are anonymous cattle. Do you need a discovery protocol in your application?

There may also be a penalty for bringing up a brand new instance. For example, the sync time for a new Galera instance is a function of the dataset size and network bandwidth. That can easily run into the tens-of-minutes range.

So there is an incentive to either stop modelling everything as cattle or to keep the state somewhere else.

Ok, so lets put all the state in a database. Problem solved right?

You could create a single widget with both the application and the database. That would allow you to use systemd to achieve Node HA

the automated and deterministic management of a set of resources within a single node.

In some ways, systemd looks a lot like a cluster manager. It knows about sets of services, it knows relationships between them (so that the database is started before the application) and it knows how to recover the stack if something fails.

Unfortunately you’re out of luck if a failure on node A requires recovery (of the same or a different service) on nodeB because the caveat is that all these relationships must exist within a single node.

This of course is not the container model - which likes to have each service in its own widget. More importantly, you always need to pay the database synchronisation cost for every failure which is not ideal.

Alternatively, if your application isn’t active-active, you don’t even get the option of combining them into a single flavour of widget.

By splitting them up into two however, the synchronisation cost is only payable when a database widget dies. This improves your recovery time and makes the widget purists happy, but now you make need to make sure the application doesn’t start until the database is both running, synchronised and available to take requests.

About now you might be tempted to think about putting retry loops in the application instead.

Chances are however, there is another service that is a client of the application (and there is a client of the client of the …).

Every time you build in another level of retry loops, you increase your failure detection time and ultimately your downtime.

Hence the question: How smart is your widget manager?

It needs to ensure there are at least N copies of a widget active.
It might need to ensure there are less than M copies available.
It might need to ensure the application starts after the database.
It might need to be able to stop the application if not enough copies of the database are around and/or writable. Perhaps it got corrupted? Perhaps someone needs to do maintenance?

Lets assume the widget manager can do these things. Most can, that means we’re done right?

Just because the widget manager cannot see one of its peers with a bunch of application widgets, does not mean they’re not happily swallowing client requests they can never process and/or writing to the data store via some other subnet.

If this does not apply to your application, consider yourself blessed.

For the rest of us, in order to preserve data integrity, we need the widget manager to take steps to ensure that the peer it can no longer see does not have any active widgets.

This is one reason why systemd is rarely sufficient on its own.

Hint: A great way to do this is to power off the host

Are you done yet?

One thing we skipped over is where the database itself is storing its state.

If you were using bare metal, you could store it there - but thats old-fashioned. Storing it in the KVM image or docker container isn’t a good idea, you’d loose everything if the last container ever died.

Projects like glusterfs are options, just be sure you understand what happens when partitions form.

If you’re thinking of something like NFS or iSCSI, consider where those would come from. Almost certainly you don’t want a single node serving them up - that would introduce a single point of failure and the whole point of this is to remove those.

You could add a SAN into the mix for some hardware redundancy, however you need to ensure either:

exactly one node accesses the SAN at any time (active/passive), or
your filesystem can handle concurrent reads and write from multiple hosts (active/active)

Both options will require quorum and fencing in order to reliably hand out locks. This is the sweet-spot of a full blown cluster manager, System HA, and why traditional, scary, cluster managers like Pacemaker and Veritas exist.

Unless you’d like to manually resolve block-level conflicts after a split-brain, some part of the system needs to rigorously enforce these things. Otherwise its widget managers all the way down.

One of Us

Once you have a traditional cluster manager, you might be surprised how useful it can be.

A lot of applications are resilient to failures once they’re up, but have non-trivial startup sequences. Consider RabbitMQ:

Pick one active node and start rabbitmq-server
Everywhere else, run
- Start rabbitmq-server
- rabbitmqctl stop_app
- rabbitmqctl join_cluster rabbit@${firstnode}
- rabbitmqctl start_app

Now the Rabbit’s built-in HA can take over but to get to that point:

How do you pick which is the first node?
How do you tell everyone else who it is?
Can rabbitmq accept updates before all peers have joined?
Can your app?

This is the sort of thing traditional cluster managers do before breakfast. They are afterall really just distributed finite state machines.

Recovery can be a troublesome time too:

http://previous.rabbitmq.com/v3_3_x/clustering.html

the last node to go down must be the first node to be brought online. If this doesn’t happen, the nodes will wait 30 seconds for the last discconected node to come back online, and fail afterwards.

Depending on how the nodes were started, you may see some nodes running and some stopped. What happens if the last node isn’t online yet?

Some cluster managers support concepts like dual-phased services (or “Master/slave” to use the politically incorrect term) that can allow automated recovery even with constraints such as these. We have Galera agents that also take advantage of these capabilities - finding the ‘right’ node to bootstrap before synchronising it to all the peers.

Final thoughts

HA is a spectrum, where you fit depends on what assumptions you can make about your application stack.

Just don’t make those assumptions before you really understand the problem at hand, because retrofitting an application to remove simplifying assumptions (such as only supporting pets) is even harder than designing it in in the first place.

Whats your minimal viable cluster?

That Cluster Guy

Savaged by Softdog, a Cautionary Tale

A New Fencing Mechanism (TBD)

Protecting Database Centric Applications

A New Type of Death

Limitations

Footnotes

A New Thing

Two Nodes - The Devil is in the Details

Fencing

Direct Fencing

Indirect Fencing

Quorum

Making Two Nodes Work

Option 1 - Add a Backup Fencing Method

Option 2 - Add an Arbitrator

Option 3 - More Human Than Human

Bonus Option

Two Racks

Two Datacenters

Footnotes

Containerizing Databases with Kubernetes and Stateful Sets

TL;DR

General Comments

Ordered Bootstrap and Recovery

Extending the Pattern to Galera

Removing Shared Storage

Summing Up

HA for Composible Deployments of OpenStack

Considerations

Scope

Proposed Solution

Implementation Details

Final Result

Thoughts on HA for Multi-Subnet Deployments of OpenStack

Low Latency Links

Higher Latency

External Load Balancers

General Considerations

Working with OpenStack Images

Creating Images

Adding Packages to an existing Image

Logging in

Customizing an Instance at Boot Time

Running Scripts

When Scripts Take a Really Long Time

Installing Packages

Installing Packages and Running scripts

Evolving the OpenStack HA Architecture

Objection 1 - Pacemaker as an Alerting Mechanism

Objection 2 - Pacemaker has better Monitoring

Bonus Round - Active/Passive FTW

Minimum Viable Cluster

Starting Simple

Maybe. It might depend on your hardware budget.

Maybe. Is your application truely stateless?

Maybe. How smart is your widget manager?

Maybe. What happens if the widget manager cannot see one of its peers?

Are you done yet?

One of Us

Final thoughts