<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xml" href="/feed.xslt.xml"?>

<feed xmlns="http://www.w3.org/2005/Atom">
  <generator uri="http://jekyllrb.com" version="3.10.0">Jekyll</generator>
  <link href="https://www.beekhof.net/atom.xml" rel="self" type="application/atom+xml" />
  <link href="https://www.beekhof.net/" rel="alternate" type="text/html" />
  <updated>2025-07-30T10:38:57+10:00</updated>
  <id>https://www.beekhof.net/</id>

  
    <title>That Cluster Guy</title>
  

  
    <subtitle>A thing I do.</subtitle>
  

  
    <author>
      
        <name>Andrew Beekhof</name>
      
      
        <email>andrew@beekhof.net</email>
      
      
    </author>
  

  
  
    <entry>
      <title>Building a Kubernetes Operator in a Day with AI</title>
      <link href="https://www.beekhof.net/blog/2025/coach-coding" rel="alternate" type="text/html" title="Building a Kubernetes Operator in a Day with AI" />
      <published>2025-07-30T10:28:00+10:00</published>
      
        <updated>2025-07-30T10:28:00+10:00</updated>
      

      <id>https://www.beekhof.net/blog/2025/coach-coding</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2025/coach-coding">&lt;h1 id=&quot;my-brain-is-utterly-broken-how-i-built-a-kubernetes-operator-in-a-day-with-ai&quot;&gt;My Brain is Utterly Broken: How I Built a Kubernetes Operator in a Day with AI&lt;/h1&gt;

&lt;p&gt;As someone that always took pride in my ability to rapidly implement features
and generally solve problems in software, I was extremely skeptical of the “vibe
coding” trend.&lt;/p&gt;

&lt;p&gt;Despite my misgivings, I felt I needed to understand it before completely
disregarding it.  However, after giving it a try, I cannot stop talking about
it.&lt;/p&gt;

&lt;p&gt;It’s been a genuine paradigm shift for me.  The process took me from a simple,
niche, idea to a functioning project far faster than I could have ever imagined.&lt;/p&gt;

&lt;p&gt;Honestly, my mind has been blown, the veil has been lifted, and I’m still trying
to piece together what it all means.&lt;/p&gt;

&lt;h2 id=&quot;the-idea-that-no-one-had-time-for&quot;&gt;The Idea That No One Had Time For&lt;/h2&gt;

&lt;p&gt;It started, as many of these things do, with a real customer need. I had an idea
for a new operator that would help our Virtualization customers using ODF:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A cloud-native version of SBD from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clusterlabs&lt;/code&gt;, designed to work with NHC from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;medik8s&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are maybe a dozen people in the world to whom that sentence makes sense
without a fair bit of research (I didn’t tell it that SBD was Storage Based
Death, that NHC was Node Health Check, or what either of those two things
did&lt;em&gt;)&lt;/em&gt;.  Which made it the perfect test case: it was non-trivial, addressed a
real-world problem, and wholesale plagiarism was off the table.  With no
engineering capacity to do it properly, there was nothing to lose.&lt;/p&gt;

&lt;h2 id=&quot;the-vibe-coding-experiment&quot;&gt;The “Vibe Coding” Experiment&lt;/h2&gt;

&lt;p&gt;Inspired by a &lt;a href=&quot;https://www.linkedin.com/learning/structure-vibe-coding-to-save-build-time/tips-for-your-ai-vibe-coding-journey&quot;&gt;LinkedIn
course&lt;/a&gt;
on “vibe coding”, I decided to put the process to the test. I was sceptical it
would be applicable outside of their carefully navigated demos, and stacked the
deck against it. My goal was to model a worst-case scenario:&lt;/p&gt;

&lt;p&gt;What would an intern produce on their first day using this new AI-driven process?&lt;/p&gt;

&lt;p&gt;How much work would it be to get that result into a shippable state?&lt;/p&gt;

&lt;p&gt;I fed my one-sentence idea into the process. Using Gemini, I let it take the
reins, answering its questions with “you decide”, and did not even bother to
read the results.  I fed the generated prompts into the newly approved &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cursor&lt;/code&gt;
AI tool, provided zero oversight, and blindly accepted any code it generated…&lt;/p&gt;

&lt;p&gt;A few hours later, I had &lt;a href=&quot;https://github.com/beekhof/sbd-operator/tree/v1.0&quot;&gt;https://github.com/beekhof/sbd-operator/tree/v1.0&lt;/a&gt; and this concise problem statement:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Current Kubernetes node remediation solutions often rely on IPMI/iDRAC for
 fencing, which is not feasible in many cloud environments or certain on-premise
 setups. For stateful workloads depending on shared storage (like Ceph,
 traditional SANs via CSI), a mechanism is needed to reliably fence unhealthy
 nodes by leveraging their shared storage access, ensuring data consistency and
 high availability by preventing split-brain scenarios in a way that is
 consistent with the workloads it protects.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This was not a toy.&lt;/p&gt;

&lt;p&gt;Let’s be clear though: the code didn’t work. There were gaps in the
implementation, the AI had cleverly used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--ginkgo.dry-run&lt;/code&gt; to fake its way
through the e2e tests, and the build system was a Rube Goldberg machine that
made my eyes bleed. But as a starting point created in just a few hours? It had
no right to be this good. In one day, I had a design, tests, and a
well-documented codebase.&lt;/p&gt;

&lt;h2 id=&quot;the-real-magic-ai-powered-iteration&quot;&gt;The Real Magic: AI-Powered Iteration&lt;/h2&gt;

&lt;p&gt;While the initial code drop was impressive, the follow-up process was truly
transformative. The bot iterates faster than I ever could.&lt;/p&gt;

&lt;p&gt;For instance, my colleague pointed out that the operator should be using
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conditions&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;phases&lt;/code&gt;—a standard best practice for operators. To fix
this, I simply typed the following prompt:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt:&lt;/strong&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;“convert all use of phases to conditions”&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That was it. That was the only thing I typed. About 10 minutes later, having
fixed its own compile errors and updated the tests with no additional
interaction from me, this commit appeared:
&lt;a href=&quot;https://github.com/beekhof/sbd-operator/commit/2a5f212d8bcbd8cbd46f3794d3acc2c28a28ef0b&quot;&gt;https://github.com/beekhof/sbd-operator/commit/2a5f212d8bcbd8cbd46f3794d3acc2c28a28ef0b&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the steps were often imperfect, at the speed it iterated, the cumulative
effect of even small improvements was hugely significant in short periods of
time. Two weeks into the experiment I would consider it a reasonable beta.&lt;/p&gt;

&lt;h2 id=&quot;success-means-defining-and-defending-the-success-criteria&quot;&gt;Success Means Defining (and Defending!) the Success Criteria&lt;/h2&gt;

&lt;p&gt;While the bot can crank out code and iterate at a speed that’s insane, the key
to success is human oversight—someone actually making sure its output is solid.
Sweating the details around the build system and test suite defines the bot’s
success criteria.&lt;/p&gt;

&lt;p&gt;Unsupervised, the bot will do &lt;em&gt;anything&lt;/em&gt; to allow both of those to succeed,
including cheating. This can manifest in several insidious ways: it might
transform complex, problematic code sections into simplistic print statements,
effectively sidestepping the actual logic that needs testing. I have also seen
it subtly alter the assertions and conditions within the tests, ensuring they no
longer rigorously check for the intended functionality but instead validate a
more easily achievable (and potentially incorrect) outcome. In the most extreme
instances, it has outright deleted tests that proved too difficult to pass.&lt;/p&gt;

&lt;p&gt;Likewise, the bot is so ridiculously good at just &lt;em&gt;making&lt;/em&gt; code, it sometimes
completely ignores the idea of reusing existing functionality. Why bother
finding and integrating an existing function when it can whip up a new, slightly
different version in seconds?  Unchecked, the result is dozens of essentially
the same function with subtle differences.  It’s not just messy, it’s a
maintenance nightmare that takes care and time to unravel.  Worst of all, the
sheer volume of code represents a bigger surface for the bot to “adjust” in
order to make tests pass, and harder to spot in bulk changes.&lt;/p&gt;

&lt;p&gt;These actions, while entirely predictable, highlight the need for a human to be
in control at all times.&lt;/p&gt;

&lt;h2 id=&quot;a-better-name&quot;&gt;A Better Name&lt;/h2&gt;

&lt;p&gt;The term “vibe coding” still makes me somewhat nauseous. It sounds like
something you do after a few too many energy drinks, and reflects poorly on both
the process and the folks wielding it so effectively.&lt;/p&gt;

&lt;p&gt;To me, the process felt like coaching a seriously brilliant, albeit occasionally
mischievous, junior engineer who’s somehow operating at warp speed. You’re
constantly guiding, correcting, and nudging it in the right direction, providing
that human oversight. I’m starting to use the term “Coach Coding” instead, as it
is a more accurate reflection of the collaborative, iterative, and ultimately
human-driven nature of the process.&lt;/p&gt;

&lt;h2 id=&quot;an-insanely-powerful-tool&quot;&gt;An Insanely Powerful Tool&lt;/h2&gt;

&lt;p&gt;The bot is not perfect, but that’s not the point. In the hands of an experienced
practitioner who can spot missteps and highlight areas for improvement, it is an
insanely powerful tool. It’s not about replacing the developer, but amplifying
their ability to execute and iterate at a speed I’ve never imagined possible.&lt;/p&gt;

&lt;p&gt;The experience of going from a niche idea to a tangible, well-structured project
in a matter of days has fundamentally changed my perspective on what’s possible.&lt;/p&gt;

&lt;p&gt;The future of development is already here, and it’s more amazing than I thought
possible.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="ai" />
      
        <category term="best practices" />
      

      
        <summary>Vibe coding sounds like something you do after too many energy drinks, but after giving it a try, and despite its flaws, I cannot stop talking about the experience.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>Savaged by Softdog, a Cautionary Tale</title>
      <link href="https://www.beekhof.net/blog/2019/savaged-by-softdog" rel="alternate" type="text/html" title="Savaged by Softdog, a Cautionary Tale" />
      <published>2019-10-11T13:55:00+11:00</published>
      
        <updated>2019-10-11T13:55:00+11:00</updated>
      

      <id>https://www.beekhof.net/blog/2019/savaged-by-softdog</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2019/savaged-by-softdog">&lt;p&gt;Hardware is imperfect, and software contains bugs. When node level failures occur, the work required from the cluster does not decrease - affected workloads need to be restarted, putting additional stress on surviving peers and making it important to recover the lost capacity.&lt;/p&gt;

&lt;p&gt;Additionally, some of workloads may require at-most-one semantics.  Failures affecting these kind of workloads risk data loss and/or corruption if ”lost” nodes remain at least partially functional.  For this reason the system needs to know that the node has reached a safe state before initiating recovery of the workload.  &lt;/p&gt;

&lt;p&gt;The process of putting the node into a safe state is called fencing, and the HA community generally prefers power based methods because they provide the best chance of also recovering capacity without human involvement.&lt;/p&gt;

&lt;p&gt;There are two categories of fencing which I will call &lt;em&gt;direct&lt;/em&gt; and &lt;em&gt;indirect&lt;/em&gt; but could equally be called &lt;em&gt;active&lt;/em&gt; and &lt;em&gt;passive&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Direct methods involve action on the part of surviving peers, such interacting with an IPMI or iLO device, whereas indirect methods rely on the failed node to somehow recognise it is in an unhealthy state &lt;em&gt;and&lt;/em&gt; take steps to enter a safe state on its own.&lt;/p&gt;

&lt;p&gt;The most common form of indirect fencing is the use of a &lt;a href=&quot;https://en.wikipedia.org/wiki/Watchdog_timer&quot;&gt;watchdog&lt;/a&gt;. The watchdog’s timer is reset every N seconds unless quorum is lost or the part of the software stack fails.  If the timer (usually some multiple of N) expires, then the the watchdog will panic (not shutdown) the machine.&lt;/p&gt;

&lt;p&gt;When done right, watchdogs can allow survivors to safely assume that missing nodes have entered a safe state after a defined period of time.&lt;/p&gt;

&lt;p&gt;However when relying on indirect fencing mechanisms, it is important to recognise that in the absence of out-of-band communication such as disk based heartbeats, surviving peers have absolutely no ability to validate that the lost node ever reaches a safe state, surviving peers are making an &lt;em&gt;assumption&lt;/em&gt; when they start recovery.   There is a risk it didn’t happen as planned and the cost of getting it wrong is data corruption and/or loss.&lt;/p&gt;

&lt;p&gt;Nothing is without risk though.  Someone with an overdeveloped sense of paranoia and an infinite budget could buy all of Amazon, plus Microsoft and Google for redundancy, to host a static website - and still be undone by an asteroid.  The goal of HA is not to eliminate risk, but reduce it to an acceptable level.  What constitutes an acceptable risk varies person-to-person, project-to-project, and company-to-company, however as a community we encourage people to start by eliminating &lt;a href=&quot;https://en.wikipedia.org/wiki/Single_point_of_failure&quot;&gt;single points of failure (SPoF)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the absence of direct fencing mechanisms, we like hardware based watchdogs because as a self-contained device they can panic the machine without involvement from the host OS.  If the watchdog fails, the node is still healthy and data loss can only occur through failure of additional nodes.  In the event of a power outage, they also loose power but the node is already safe. A network failure is no longer a SPoF and would require a software bug (incorrect quorum calculations for example) in order to present a problem.&lt;/p&gt;

&lt;p&gt;There is one last class of failures, software bugs, that are the primary concern HA and kernel experts whenever Softdog is put forward in situations where already purchased cluster machines lack both power management and watchdog hardware.&lt;/p&gt;

&lt;p&gt;Softdog malfunctions originating in software can take two forms - resetting a machine when it should not have (false positive), and not resetting a machine when it should have (false negative). False positives will reduce overall availability due to repeated failovers, but the integrity of the system and its data will remain intact.&lt;/p&gt;

&lt;p&gt;More concerning is the possibility for a single software bug to both cause a node to become unavailable and prevent softdog from recovering the system.   One option for this is a bug in a device or device driver, such as a tight loop or bad spinlock usage, that causes the &lt;a href=&quot;https://bugzilla.redhat.com/buglist.cgi?quicksearch=NMI%20watchdog&quot;&gt;system bus to lock up&lt;/a&gt;.  In such a scenario the watchdog timer would expire, but the softdog would not be able to trigger a reboot.  In this state it is not be possible to recover the cluster’s capacity without human intervention, and in theory the entire machine is in a state that prevents it from being able to receive or act on client requests - although &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=1718747&quot;&gt;perhaps not always&lt;/a&gt; (unfortunately the interesting parts of the bug are private).&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If the customer needs guaranteed reboot, they should install a hardware watchdog.&lt;/p&gt;

  &lt;p&gt;— Mikulas Patocka (Red Hat kernel engineer)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The greatest danger of softdog, is that most of the time it appears to work just fine.  For months or years it will reboot your machines in response to network and software outages, only to fail you when just the wrong conditions are met.&lt;/p&gt;

&lt;p&gt;Imagine a pointer error, the kind that corrupts the kernel’s internal structures and &lt;a href=&quot;https://bugzilla.redhat.com/buglist.cgi?quicksearch=kernel%20pointer&quot;&gt;causes kernel panics&lt;/a&gt;.  Rarely triggered, but one day you get unlucky and the area of memory that gets scribbled on includes the softdog.&lt;/p&gt;

&lt;p&gt;Just like all the other times it causes the machine to misbehave, but the surviving peers detect it, wait a minute or two, and then begin recovery.  Application services are started, volumes are mounted, database replicas are promoted to master, VIPs are brought up, and requests start being processed.&lt;/p&gt;

&lt;p&gt;However unlike all the other times, the failed peer is still active because the softdog has been corrupted, the application services remain responsive and nothing has removed VIPs or demoted masters.&lt;/p&gt;

&lt;p&gt;At this point, your best case scenario is that database and storage replication is broken.  Requests from some clients will go to the failed node, and some will go to its replacement.  Both will succeed, volumes and databases will be updated independently of what happened on the other peer.  Reads will start to return stale or otherwise inaccurate data, and incorrect decisions will be made based on them.  No transactions will be lost, however the longer the split remains, the further the datasets will drift apart and the more work it will be to reconcile them by hand once the situation is discovered.&lt;/p&gt;

&lt;p&gt;Things get worse if replication doesn’t break.  Now you have the prospect of uncoordinated parallel access to your datasets.  Even if database locking is still somehow working, eventually those changes are persisted to disk and there is nothing to prevent both sides from writing different versions of the same backing file due to non-overlapping database updates.&lt;/p&gt;

&lt;p&gt;Depending on the timing and scope of the updates, you could get:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;only whole file copies from the second writer and loose transactions from the first,&lt;/li&gt;
  &lt;li&gt;whole file copies from a mixture of hosts, leading to a corrupted on-disk representation,&lt;/li&gt;
  &lt;li&gt;files which contain a mixture of bits from both hosts, also leading to a corrupted on-disk representation, or&lt;/li&gt;
  &lt;li&gt;all of the above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ironically an admin’s first instinct, to restart the node or database and see if that fixes the situation, might instead wipe out the only remaining consistent copy of their data (asuming the entire database fits in memory).  At which point all transactions since the previous backup are lost.&lt;/p&gt;

&lt;p&gt;To mitigate this situation, you would either need &lt;em&gt;very&lt;/em&gt; frequent backups, or add a SCSI based fencing mechanism to ensure exclusive access to shared storage, and a network based mechanism to prevent requests from reaching the failed peer.&lt;/p&gt;

&lt;p&gt;Or you could just use a hardware watchdog (even better, try a &lt;a href=&quot;https://www.apc.com/shop/us/en/categories/power-distribution/rack-power-distribution/metered-rack-pdu/N-wj7jiz&quot;&gt;network power switch&lt;/a&gt;).&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="high availability" />
      
        <category term="watchdog" />
      
        <category term="softdog" />
      
        <category term="best practices" />
      

      
        <summary>Hardware is imperfect, and software contains bugs. Don&apos;t use software based watchdogs and expect to survive the latter.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>A New Fencing Mechanism (TBD)</title>
      <link href="https://www.beekhof.net/blog/2018/tbd-fencing" rel="alternate" type="text/html" title="A New Fencing Mechanism (TBD)" />
      <published>2018-03-07T13:11:00+11:00</published>
      
        <updated>2018-03-07T13:11:00+11:00</updated>
      

      <id>https://www.beekhof.net/blog/2018/tbd-fencing</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2018/tbd-fencing">&lt;h2 id=&quot;protecting-database-centric-applications&quot;&gt;Protecting Database Centric Applications&lt;/h2&gt;

&lt;p&gt;In the same way that some application require the ability to persist records
to disk, for some applications the loss of access to the database means game
over - more so than disconnection from the storage.&lt;/p&gt;

&lt;p&gt;Cinder-volume is one such application and as it moves towards an active/active
model, it is important that a failure in one peer does not represent a SPoF.
In the Cinder architecture, the API server has no way to know if the cinder-
volume process is fully functional - so they will still recieve new requests
to execute.&lt;/p&gt;

&lt;p&gt;A cinder-volume process that has lost access to the storage will naturally be
unable to complete requests.  Worse though is loosing access to the database,
as this will means the result of an action cannot be recorded.&lt;/p&gt;

&lt;p&gt;For some operations this is ok, if wasteful, because the operation will fail
and be retried. Deletion of something that was already deleted is usually
treated as a success and re-attempted operations for creating volume will
return a new volume. However performing the same resize operation twice is
highly problematic since the recorded old size no longer matches the actual
size.&lt;/p&gt;

&lt;p&gt;Even the safe operations may never complete because the bad cinder-volume
process may end up being asked to perform the cleanup operations from its own
failures, which would result in additional failures.&lt;/p&gt;

&lt;p&gt;Additionally, despite not being recommended, some Cinder drivers make use of
locking.  For those drivers it is just as crucial that any locks held by a
faulty or hung peer can be recovered within a finite period of time.  Hence
the need for fencing.&lt;/p&gt;

&lt;p&gt;Since power-based fencing is so dependant on node hardware and there is always
some kind of storage involved, the idea of leveraging the SBD&lt;a href=&quot;#fnote1&quot;&gt;[1]&lt;/a&gt; (
&lt;a href=&quot;/blog/2015/sbd-fun-and-profit&quot;&gt;Storage Based Death&lt;/a&gt; ) project’s capabilities
to do disk based heartbeating and poison-pills is attractive. When combined
with a hardware watchdog, it is an extremely reliable way to ensure safe
access to shared resources.&lt;/p&gt;

&lt;p&gt;However in Cinder’s case, not all vendors can provide raw access to a small
block device on the storage.  Additionally, it is really access to the
database that needs protecting not the storage.  So while useful, it is still
relatively easy to construct scenarios that would defeat SBD.&lt;/p&gt;

&lt;h2 id=&quot;a-new-type-of-death&quot;&gt;A New Type of Death&lt;/h2&gt;

&lt;p&gt;Where SBD uses storage APIs to protect applications persisting data to disk,
we could also have one based on SQL calls that did the same for Cinder-volume
and other database centric applications.&lt;/p&gt;

&lt;p&gt;I therefor propose TBD - “Table Based Death” (or “To Be Decided” depending on
how you’re wired).&lt;/p&gt;

&lt;p&gt;Instead of heartbeating to a designated slot on a block device, the slots
become rows in a small table in the database that this new daemon would
interact with via SQL.&lt;/p&gt;

&lt;p&gt;When a peer is connected to the database, a cluster manager like Pacemaker can
use a poison pill to fence the peer in the event of a network, node, or
resource level failure.  Should the peer ever loose quorum or its connection
to the database, surviving peers can assume with a degree of confidence that
it will self terminate via the watchdog after a known interval.&lt;/p&gt;

&lt;p&gt;The desired behaviour can be derived from the following properties:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Quorum is required to write poison pills into a peer’s slot&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A peer that finds a poison pill in its slot triggers its watchdog and reboots&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A peer that looses connection to the database won’t be able to write status
information to its slot which will trigger the watchdog&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A peer that looses connection to the database won’t be able to write a 
poison pill into another peer’s slot&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If the underlying database looses too many peers and reverts to read-only,
we won’t be able to write to our slot which triggers the watchdog&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;When a peer that looses connection to its peers, the survivors would 
maintain quorum(1) and write a poison pill to the lost node (1) ensuring
the peer will terminate due to scenario (2) or (3)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; seconds is the worst case time a peer would need to either notice a
poison pill, or disconnection from the database, and trigger the watchdog.
Then we can arrange for services to be recovered after some multiple of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt;
has elasped in the same way that Pacemaker does for SBD.&lt;/p&gt;

&lt;p&gt;While TBD would be a valuable addition to a traditional cluster architecture,
it is also concievable that it could be useful in a stand-alone configuration.
Consideration should therefor be given during the design phase as to how best
consume membership, quorum, and fencing requests from multiple sources - not
just a particular application or cluster manager.&lt;/p&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;Just as in the SBD architecture, we need TBD to be configured to use the same
persistent store (database) as is being consumed by the applications it is
protecting.  This is crucial as it means the same criteria that enables the
application to function, also results in the node self-terminating if it
cannot be satisfied.&lt;/p&gt;

&lt;p&gt;However for security reasons, the table would ideally live in a different
namespace and with different access permissions.&lt;/p&gt;

&lt;p&gt;It is also important to note that significant design challenges would need to
be faced in order to protect applications managed by the same cluster that was
providing the highly available database being consumed by TBD.  Consideration
would particularly need to be given to the behaviour of TBD and the
applications it was protecting during shudown and cold-start scenarios.  Care
would need to be taken in order to avoid unnecessary self-fencing operations
and that failure responses are not impacted by correctly handling these
scenarios.&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;p&gt;&lt;a name=&quot;fnote1&quot;&gt;[1]&lt;/a&gt; SBD lives under the ClusterLabs banner but can
operate without a traditional corosync/pacemaker stack.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="cluster" />
      
        <category term="fencing" />
      
        <category term="concepts" />
      
        <category term="cinder" />
      
        <category term="openstack" />
      

      
        <summary>Protecting database centric applications in the absense of power fencing</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>A New Thing</title>
      <link href="https://www.beekhof.net/blog/2018/replication-operator" rel="alternate" type="text/html" title="A New Thing" />
      <published>2018-02-16T14:32:00+11:00</published>
      
        <updated>2018-02-16T14:32:00+11:00</updated>
      

      <id>https://www.beekhof.net/blog/2018/replication-operator</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2018/replication-operator">&lt;p&gt;I made a &lt;a href=&quot;https://github.com/beekhof/rss-operator&quot;&gt;new thing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you’re interested in &lt;a href=&quot;https://kubernetes.io&quot;&gt;Kubernetes&lt;/a&gt; and/or managing
replicated applications, such as Galera, then you might also be interested in
an &lt;a href=&quot;https://coreos.com/blog/introducing-operators.html&quot;&gt;operator&lt;/a&gt; that allows
this class of applications to be managed natively by Kubernetes.&lt;/p&gt;

&lt;p&gt;There is plenty to read on &lt;a href=&quot;https://github.com/beekhof/rss-operator/blob/master/doc/Rationale.md&quot;&gt;why&lt;/a&gt;
the operator exists, &lt;a href=&quot;https://github.com/beekhof/rss-operator/blob/master/doc/design/replication.md&quot;&gt;how&lt;/a&gt;
replication is managed and the steps to &lt;a href=&quot;https://github.com/beekhof/rss-operator/blob/master/doc/user/install_guide.md&quot;&gt;install it&lt;/a&gt;
if you’re interested in trying it out.&lt;/p&gt;

&lt;p&gt;There is also a screencast that demonstrates the major concepts:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://asciinema.org/a/164903&quot;&gt;&lt;img src=&quot;https://asciinema.org/a/164903.png&quot; alt=&quot;asciicast&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback welcome.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="high availability" />
      
        <category term="kubernetes" />
      
        <category term="galera" />
      

      
        <summary>I made something new, maybe you&apos;ll find it useful</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>Two Nodes - The Devil is in the Details</title>
      <link href="https://www.beekhof.net/blog/2018/two-node-problems" rel="alternate" type="text/html" title="Two Nodes - The Devil is in the Details" />
      <published>2018-02-16T10:52:00+11:00</published>
      
        <updated>2018-02-16T10:52:00+11:00</updated>
      

      <id>https://www.beekhof.net/blog/2018/two-node-problems</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2018/two-node-problems">&lt;p&gt;&lt;em&gt;tl;dr - Many people love 2-node clusters because they seem conceptually simpler and 33%
cheaper, but while it’s possible to construct good ones, most will have subtle
failure modes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first step towards creating any HA system is to look for and try to
eliminate single points of failure, often abbreviated as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SPoF&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It is impossible to eliminate &lt;em&gt;all&lt;/em&gt; risk of downtime and especially when one
considers the additional complexity that comes with introducing additional
redunancy, concentrating on single (rather than chains of related and therefor
decreasingly probable) points of failure is widely accepted as a suitable
compromise.&lt;/p&gt;

&lt;p&gt;The natural starting point then is to have more than one node.  However before
the system can move services to the surviving node after a failure, in
general, it needs to be sure that they are not still active elsewhere.&lt;/p&gt;

&lt;p&gt;So not only are we looking for SPoFs, but we are also looking to balance risks
and consequences and the calculus will be different for every deployment &lt;a href=&quot;#fnote1&quot;&gt;[1]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is no downside if a failure causes both members of a two node cluster to
serve up the same static website. However its a very different story if it
results in &lt;strong&gt;both sides independently managing a shared job queue or
providing uncoordinated write access to a replicated database or shared
filesystem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So in order to prevent a single node failure from corrupting your data or
blocking recovery, we rely on something called fencing.&lt;/p&gt;

&lt;h2 id=&quot;fencing&quot;&gt;Fencing&lt;/h2&gt;

&lt;p&gt;At it s heart, fencing turns a question &lt;em&gt;Can our peer cause data
corruption?&lt;/em&gt; into an answer &lt;em&gt;no&lt;/em&gt; by isolating it both from incoming
requests and persistent storage. The most common approach to fencing is to
power off failed nodes.&lt;/p&gt;

&lt;p&gt;There are two categories of fencing which I will call &lt;em&gt;direct&lt;/em&gt; and &lt;em&gt;indirect&lt;/em&gt;
but could equally be called &lt;em&gt;active&lt;/em&gt; and &lt;em&gt;passive&lt;/em&gt;. Direct methods involve
action on the part of surviving peers, such interacting with an IPMI or iLO
device, whereas indirect relies on the failed node to somehow recognise it is
in an unhealthy state (or is at least preventing remaining members from
recovering) and signal a &lt;a href=&quot;https://en.wikipedia.org/wiki/Watchdog_timer&quot;&gt;hardware
watchdog&lt;/a&gt; to panic the machine.&lt;/p&gt;

&lt;p&gt;Quorum helps in both these scenarios.&lt;/p&gt;

&lt;h3 id=&quot;direct-fencing&quot;&gt;Direct Fencing&lt;/h3&gt;

&lt;p&gt;In the case of direct fencing, we can use it to prevent fencing races when the
network fails.   By including the concept of quorum, there is enough
information in the system (even without connectivity to their peers) for nodes
to automatically know whether they should initiate fencing and/or recovery.&lt;/p&gt;

&lt;p&gt;Without quorum, both sides of a network split will rightly assume the other is
dead and rush to fence the other. In the worst case, both sides succeed
leaving the entire cluster offline.   The next worse is a &lt;em&gt;death match&lt;/em&gt; , a
never ending cycle of nodes coming up, not seeing their peers, rebooting them
and initiating recovery only to be rebooted when their peer goes through the
same logic.&lt;/p&gt;

&lt;p&gt;The problem with fencing is that the most commonly used devices become
inaccessible due to the same failure events we want to use them to recover
from. Most IPMI and iLO cards both loose power with the hosts they control and
by default use the same network that is causing the peers to believe the
others are offline.&lt;/p&gt;

&lt;p&gt;Sadly the intricacies of IPMI and iLo devices is rarely a consideration at the
point hardware is being purchased.&lt;/p&gt;

&lt;h3 id=&quot;indirect-fencing&quot;&gt;Indirect Fencing&lt;/h3&gt;

&lt;p&gt;Quorum is also crucial for driving indirect fencing and, when done right, can
allow survivors to safely assume that missing nodes have entered a safe state
after a defined period of time.&lt;/p&gt;

&lt;p&gt;In such a setup, the watchdog’s timer is reset every N seconds unless quorum
is lost. If the timer (usually some multiple of N) expires, then the machine
performs an ungraceful power off (not shutdown).&lt;/p&gt;

&lt;p&gt;This is very effective but without quorum to drive it, there is insufficient
information from within the cluster to determine the difference between a
network outage and the failure of your peer. The reason this matters is that
without a way to differentiate between the two cases, you are forced to choose
a single behaviour mode for both.&lt;/p&gt;

&lt;p&gt;The problem with choosing a single response is that there is no course of
action that both maximises availability and prevents corruption.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;If you choose to assume the peer is alive but it actually failed, then the
cluster has unnecessarily stopped services.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you choose to assume the peer is dead but it was just a network outage,
then the best case scenario is that you have signed up for some manual
reconciliation of the resulting datasets.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No matter what heuristics you use, it is trivial to construct a single failure
that either leaves both sides running or where the cluster unnecessarily shuts
down the surviving peer(s). Taking quorum away really does deprive the cluster
of one of the most powerful tools in its arsenal.&lt;/p&gt;

&lt;p&gt;Given no other alternative, the best approach is normally to sacrificing
availability. Making corrupted data highly available does no-one any good and
manually reconciling diverant datasets is no fun either.&lt;/p&gt;

&lt;h2 id=&quot;quorum&quot;&gt;Quorum&lt;/h2&gt;

&lt;p&gt;Quorum sounds great right?&lt;/p&gt;

&lt;p&gt;The only drawback is that in order to have it in a cluster with N members, you
need to be able to see &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N/2 + 1&lt;/code&gt; of your peers. Which is impossible in a two
node cluster after one node has failed.&lt;/p&gt;

&lt;p&gt;Which finally brings us to the fundamental issue with two-nodes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;quorum does not make sense in two node clusters, and&lt;/p&gt;

  &lt;p&gt;without it there is no way to reliably determine a course of action that
both maximises availability and prevents corruption&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even in a system of two nodes connected by a crossover cable, there is no way
to conclusively differentiate between a network outage and a failure of the
other node. Unplugging one end (who’s likelihood is surely proportional to the
distance between the nodes) would be enough to invalidate any assumption that
link health equals peer node health.&lt;/p&gt;

&lt;h2 id=&quot;making-two-nodes-work&quot;&gt;Making Two Nodes Work&lt;/h2&gt;

&lt;p&gt;Sometimes the client can’t or wont make the additional purchase of a third
node and we need to look for alternatives.&lt;/p&gt;

&lt;h3 id=&quot;option-1---add-a-backup-fencing-method&quot;&gt;Option 1 - Add a Backup Fencing Method&lt;/h3&gt;

&lt;p&gt;A node’s iLO or IPMI device represents a SPoF because, by definition, if it
fails the survivors cannot use it to put the node into a safe state. In a
cluster of 3 nodes or more, we can mitigate this a quorum calculation and a
hardware watchdog (an indirect fencing mechanism as previously discussed).  In
a two node case we must instead use &lt;em&gt;network power switches&lt;/em&gt; (aka. &lt;em&gt;power
distribution units&lt;/em&gt; or PDUs).&lt;/p&gt;

&lt;p&gt;After a failure, the survivor first attempts to contact the primary (the
built-in iLO or IPMI) fencing device.  If that succeeds, recovery proceeds as
normal.  Only if the iLO/IPMI device fails is the PDU invoked and assuming it
succeeds, recovery can again continue.&lt;/p&gt;

&lt;p&gt;Be sure to place the PDU on a &lt;em&gt;different network to the cluster traffic&lt;/em&gt;,
otherwise a single network failure will prevent access to both fencing devices
and block service recovery.&lt;/p&gt;

&lt;p&gt;You might be wondering at this point… &lt;em&gt;doesn’t the PDU represent a single
point of failure?&lt;/em&gt; To which the answer is “definitely“.&lt;/p&gt;

&lt;p&gt;If that risk concerns you, and you would not be alone, connect both peers to
two PDUs and tell your cluster software to use both when powering peers on and
off. Now the cluster remains active if one PDU dies, and would require a
second fencing failure of either the other PDU or an IPMI device in order to
block recovery.&lt;/p&gt;

&lt;h3 id=&quot;option-2---add-an-arbitrator&quot;&gt;Option 2 - Add an Arbitrator&lt;/h3&gt;

&lt;p&gt;In some scenarios, although a backup fencing method would be technically
possible, it is politically challenging. Many companies like to have a degree
of separation between the admin and application folks, and security conscious
network admins are not always enthusiastic about handing over the usernames
and passwords to the PDUs.&lt;/p&gt;

&lt;p&gt;In this case, the recommended alternative is to create a neutral third-party
that can supplement the quorum calculation.&lt;/p&gt;

&lt;p&gt;In the event of a failure, a node needs to be able to see ether its peer or
the arbitrator in order to recover services. The arbitrator also includes to
act as a tie-breaker if both nodes can see the arbitrator but not each other.&lt;/p&gt;

&lt;p&gt;This option needs to be paired with an indirect fencing method, such as a
watchdog that is configured to panic the machine if it looses connection to
its peer and the arbitrator. In this way, the survivor is able to assume with
reasonable confidence that its peer will be in a safe state after the watchdog
expiry interval.&lt;/p&gt;

&lt;p&gt;The practical difference between an arbitrator and a third node is that the
arbitrator has a much lower footprint and can act as a tie-breaker for more
than one cluster.&lt;/p&gt;

&lt;h3 id=&quot;option-3---more-human-than-human&quot;&gt;Option 3 - More Human Than Human&lt;/h3&gt;

&lt;p&gt;The final approach is for survivors to continue hosting whatever services they
were already running, but &lt;em&gt;not start any new ones&lt;/em&gt; until either the problem
resolves itself (network heals, node reboots) or a human takes on the
responsibility of manually confirming that the other side is dead.&lt;/p&gt;

&lt;h3 id=&quot;bonus-option&quot;&gt;Bonus Option&lt;/h3&gt;

&lt;p&gt;Did I already mention you could add a third node?
We test those a lot :-)&lt;/p&gt;

&lt;h2 id=&quot;two-racks&quot;&gt;Two Racks&lt;/h2&gt;

&lt;p&gt;For the sake of argument, lets imagine I’ve convinced you the reader on the
merits of a third node, we must now consider the physical arrangement of the
nodes. If they are placed in (and obtain power from), the same rack, that too
represents a SPoF and one that cannot be resolved by adding a second rack.&lt;/p&gt;

&lt;p&gt;If this is surprising, consider what happens when the rack with two nodes
fails and how the surviving node would differentiate between this case and a
network failure.&lt;/p&gt;

&lt;p&gt;The short answer is that it can’t and we’re back to having all the problems of
the two-node case. Either the survivor:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;ignores quorum and incorrectly tries to initiate recovery during network
outages (whether fencing is able to complete is a different story and depends
on whether PDU is involved and if they share power with any of the racks), or&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;respects quorum and unnecessarily shuts itself down when its peer fails&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Either way, two racks is no better than one and the nodes must either be given
independant supplies of power or be distributed accross three (or more
depending on how many nodes you have) racks.&lt;/p&gt;

&lt;h2 id=&quot;two-datacenters&quot;&gt;Two Datacenters&lt;/h2&gt;

&lt;p&gt;By this point the more risk averse readers might be thinking about disaster
recovery. What happens when an asteroid hits the one datacenter with our three
nodes distributed across three different racks? Obviously &lt;em&gt;Bad Things(tm)&lt;/em&gt; but
depending on your needs, adding a second datacenter might not be enough.&lt;/p&gt;

&lt;p&gt;Done properly, a second datacenter gives you a (reasonably) up-to-date and
consistent copy of your services and their data.  However just like the two-
node and two-rack scenarios, there is not enough information in the system to
both maximise availability and prevent corruption (or diverging datasets).
Even with three nodes (or racks), distributing them across only two
datacenters leaves the system unable to reliably make the correct decision in
the (now far more likely) event that the two sides cannot communicate.&lt;/p&gt;

&lt;p&gt;Which is not to say that a two datacenters solution is never appropriate. It
is not uncommon for companies to &lt;em&gt;want&lt;/em&gt; a human in the loop before taking the
extraordinary step of failing over to a backup datacenter.  Just be aware that
if you want automated failure, you’re either going to need a third datacenter
in order for quorum to make sense (either directly or via an arbitrator) or
find a way to reliably power fence an entire datacenter.&lt;/p&gt;

&lt;h6 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h6&gt;

&lt;p&gt;&lt;a name=&quot;fnote1&quot;&gt;&lt;/a&gt;[1] Not everyone needs redundant power companies with
independent transmission lines.  Although the paranoia paid off for at least
one customer when their monitoring detected a failing transformer.  The customer
was on the phone trying to warn the power company when it finally blew.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="high availability" />
      

      
        <summary>Many people love 2-node clusters because they seem conceptually simpler and 33% cheaper, but most will have subtle failure modes</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>Containerizing Databases with Kubernetes and Stateful Sets</title>
      <link href="https://www.beekhof.net/blog/2017/databases-on-kubernetes" rel="alternate" type="text/html" title="Containerizing Databases with Kubernetes and Stateful Sets" />
      <published>2017-02-15T12:45:00+11:00</published>
      
        <updated>2017-02-15T12:45:00+11:00</updated>
      

      <id>https://www.beekhof.net/blog/2017/databases-on-kubernetes</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2017/databases-on-kubernetes">&lt;p&gt;The &lt;a href=&quot;https://kubernetes.io/docs/tutorials/stateful-application/run-replicated-stateful-application/&quot;&gt;canonical example&lt;/a&gt;
for Stateful Sets with a replicated application in Kubernetes is a
database.&lt;/p&gt;

&lt;p&gt;As someone looking at how to move foundational OpenStack services to
containers, and eventually Kubernetes in the future, this is great
news as databases are very typical of applications with complex
boostrap and recovery processes.&lt;/p&gt;

&lt;p&gt;If we can successfully show Kubernetes managing a multi-master
database natively safely, the patterns would be broadly applicable and
there is one less reason to have a traditional cluster manager in such
contexts.&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;TL;DR&lt;/h2&gt;

&lt;p&gt;Kubernetes today is arguably unsuitable for deploying databases UNLESS
the pod owner has the ability to verify the physical status of the
underlying hardware and is prepared to perform manual recovery in some
scenarios.&lt;/p&gt;

&lt;h1 id=&quot;general-comments&quot;&gt;General Comments&lt;/h1&gt;

&lt;p&gt;The example allows for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; slaves but limits itself to a single master.&lt;/p&gt;

&lt;p&gt;Which is absolutely a valid deployment, but does prevent us from
exploring some of the more interesting corner multi-master cases and
unfortunately from a HA perspective makes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; a single point of
failure because:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;although MySQL slaves can be easily promoted to masters, the
containers do not expose such a mechanism, and even if they did&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;writers are told to connect to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; explicitly rather than use the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mysql-read&lt;/code&gt; service&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if the worker on which &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; is running hangs or becomes
unreachable, you’re out of luck.&lt;/p&gt;

&lt;p&gt;The loss of this worker currently puts Kubernetes in a no-win
situation.  Either it does the safe thing (the current behaviour) and
prevents the pod from being recovered or the attached volume from
being accessed, leading to more downtime (because it requires an admin
to intervene) than a traditional HA solution.  Or it allows the pod to
be recovered, risking data corruption if the worker (and by inference,
the pod) is not completely dead.&lt;/p&gt;

&lt;h2 id=&quot;ordered-bootstrap-and-recovery&quot;&gt;Ordered Bootstrap and Recovery&lt;/h2&gt;

&lt;p&gt;One of the more important capabilities of StatefulSets is that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Pod N&lt;/code&gt; cannot be recovered, created or destroyed until all pods
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N-1&lt;/code&gt; are active and healthy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This allows container authors to make many simplifying assumptions
during bootstrapping and scaling events (such as who has the most
recent copy of the data at a given point).&lt;/p&gt;

&lt;p&gt;Unfortunately, until we get &lt;a href=&quot;https://github.com/kubernetes/kubernetes/pull/34160/files&quot;&gt;pod safety and termination
guarantees&lt;/a&gt;,
it means that if a worker node crashes or becomes unreachable, it’s
pods are unrecoverable and any auto-scaling policies cannot be
enacted.&lt;/p&gt;

&lt;p&gt;Additionally, the enforcement of this policy only happens at
scheduling time.&lt;/p&gt;

&lt;p&gt;This means that if there is a delay enacting the scheduler’s results,
an image must be downloaded, or an init container is part of the scale
up process, there is a significant period of time in which an existing
pod may die before new replicas can be constructed.&lt;/p&gt;

&lt;p&gt;As I type this, the current status on my testbed demonstrates this
fragility:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# kubectl get pods
NAME                         READY     STATUS        RESTARTS   AGE
[...]
hostnames-3799501552-wjd65   0/1       Pending       0          4m
mysql-0                      2/2       Running       4          4d
mysql-2                      0/2       Init:0/2      0          19h
web-0                        0/1       Unknown       0          19h
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;
&lt;/blockquote&gt;

&lt;p&gt;As described, the feature suggests this state (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mysql-2&lt;/code&gt; to be in the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;init&lt;/code&gt; state while &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mysql-1&lt;/code&gt; is not active) can never happen.&lt;/p&gt;

&lt;p&gt;While such behaviour remains possible, container authors must take
care to include logic to detect and handle such scenarios.  The
easiest course of action is to call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exit&lt;/code&gt; and cause the container to
be re-scheduled.&lt;/p&gt;

&lt;p&gt;The example partially addresses this race condition by bootstrapping
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod N&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N-1&lt;/code&gt;.  This limits the impact of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod N&lt;/code&gt;’s failure to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod
N+1&lt;/code&gt;’s startup/recovery period.&lt;/p&gt;

&lt;p&gt;It is easy to conceive of an extended solution that closed the window
completely by trying pods &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N-1&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt; in order until it found an active
peer to sync from.&lt;/p&gt;

&lt;h2 id=&quot;extending-the-pattern-to-galera&quot;&gt;Extending the Pattern to Galera&lt;/h2&gt;

&lt;p&gt;All Galera peers are writable, which makes some aspects easier and
others more complicated.&lt;/p&gt;

&lt;p&gt;Bootstrapping &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; would require some logic to determine if it is
bootstrapping the cluster (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--wsrep=&quot;&quot;&lt;/code&gt;) or in recovery mode
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--wsrep=all:6868,the:6868,peers:6868&lt;/code&gt;) but special handling of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod
0&lt;/code&gt; has precedent and is not onerous.  The remaining pods would
unconditionally use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--wsrep=all:6868,the:6868,peers:6868&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; is no longer a single point of failure with respect to writes,
however the loss of the worker it is hosted on will continue to
inhibit scaling events until manually confirmed and cleaned up by an
admin.&lt;/p&gt;

&lt;p&gt;Violations of the linear start/stop ordering could be significant if
they result from a failure of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; and occur while bootstrapping
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 1&lt;/code&gt;. Further, if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 1&lt;/code&gt; was stopped significantly earlier than
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt;, then depending on the implementation details of Galera, it is
conceivable that a failure of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; while &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 1&lt;/code&gt; is synchronising
might result in either data loss or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 1&lt;/code&gt; becoming out of sync.&lt;/p&gt;

&lt;h2 id=&quot;removing-shared-storage&quot;&gt;Removing Shared Storage&lt;/h2&gt;

&lt;p&gt;One of the main reasons to choose a replicated database is that it
doesn’t require shared storage.&lt;/p&gt;

&lt;p&gt;Having mutiple slaves certainly assists read scalability and if we
modified the example to use multiple masters it would likely improve
write performance and failover times.  However having multiple copies
of the database on the same shared storage does not provide additional
redundancy over what the storage already provides - and that is
important to some customers.&lt;/p&gt;

&lt;p&gt;While there are ways to give containers access to local storage,
attempting to make use of them for a replicated database is problematic:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;It is currently not possible to enforce that pods in a Stateful Set
always run on the same node.&lt;/p&gt;

    &lt;p&gt;Kubernetes does have the ability to assign &lt;a href=&quot;https://kubernetes.io/docs/user-guide/node-selection/&quot;&gt;node
affinity&lt;/a&gt; for
pods, however since the Stateful Sets are a template, there is no
opportunity to specify a different &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubernetes.io/hostname&lt;/code&gt; selector
for each copy.&lt;/p&gt;

    &lt;p&gt;As the example is written, this is particularly important for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt;
as it is the only writer and the only one guaranteed to have the
most up-to-date version of the data.&lt;/p&gt;

    &lt;p&gt;It is possible that to work-around this problem if the replica count
exceeds the worker count and all peers were writable masters,
however incorporating such logic into the pod would negate much of
the benefit of using Stateful Sets.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A worker going offline prevents the pod from being started.&lt;/p&gt;

    &lt;p&gt;In the shared storage case, it was possible to manually verify the
host was down, delete the pod and have Kubernetes restart it.&lt;/p&gt;

    &lt;p&gt;Without shared storage this is no longer possible for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; as that
worker is the only one with the data used to bootstrap the slaves.&lt;/p&gt;

    &lt;p&gt;The only options are to bring the worker back, or manually alter the
node affinities to have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pod 0&lt;/code&gt; replace the slave on the worker with
the most up-to-date one.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summing-up&quot;&gt;Summing Up&lt;/h2&gt;

&lt;p&gt;While Stateful Sets may not satisfy those looking for data redundancy,
they are a welcome addition to Kubernetes that will require pod safety
and termination guarantees before they can really shine.  The example
gives us a glimpse of the future but arguably shouldn’t be used in
production yet.&lt;/p&gt;

&lt;p&gt;Those looking to manage a database with Kubernetes today would be
advised to use individual pods and/or vanilla ReplicaSets, need the
ability to verify the physical status of the underlying hardware and
should be prepared to perform manual recovery in some scenarios.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="high availability" />
      
        <category term="database" />
      
        <category term="openstack" />
      
        <category term="kubernetes" />
      

      
        <summary>An examination of the canonical StatefulSet example for managing databases with Kubernetes from a rigorous HA perspective.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>HA for Composible Deployments of OpenStack</title>
      <link href="https://www.beekhof.net/blog/2016/composable-openstack-ha" rel="alternate" type="text/html" title="HA for Composible Deployments of OpenStack" />
      <published>2016-07-24T13:20:00+10:00</published>
      
        <updated>2016-07-24T13:20:00+10:00</updated>
      

      <id>https://www.beekhof.net/blog/2016/composable-openstack-ha</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2016/composable-openstack-ha">&lt;p&gt;One of the hot topics for OpenStack deployments is &lt;em&gt;composable roles&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;the ability to mix-and-match which services live on which nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is mostly a solved problem for services not managed by the
cluster, but what of the services still managed by the cluster?&lt;/p&gt;

&lt;h1 id=&quot;considerations&quot;&gt;Considerations&lt;/h1&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Scale up&lt;/p&gt;

    &lt;p&gt;Naturally we want to be able to add more capacity easily&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Scale down&lt;/p&gt;

    &lt;p&gt;And have the option to take it away again if it is no longer necessary&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Role re-assignment post-deployment&lt;/p&gt;

    &lt;p&gt;Ideally the task of taking capacity from one service and giving it
to another would be a core capability and not require a node be
nuked from orbit first.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Flexible role assignments&lt;/p&gt;

    &lt;p&gt;Ideally, the architecture would not impose limitations on how roles
are assigned.&lt;/p&gt;

    &lt;p&gt;By allowing roles to be assigned on an ad-hoc basis, we can allow
arrangements that avoid single-points-of-failure (SPoF) and
potentially take better advantage of the available hardware.  For
example:&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;node 1: galera and rabbit&lt;/li&gt;
      &lt;li&gt;node 2: galera and mongodb&lt;/li&gt;
      &lt;li&gt;node 3: rabbit and mongodb&lt;/li&gt;
    &lt;/ul&gt;

    &lt;p&gt;This also has implications when just one of the roles needs to be
scaled up (or down).  If roles become inextricably linked at
install time, this requires every service in the group to scale
identically - potentially resulting in higher hardware costs when
there are services that cannot do so and must be separated.&lt;/p&gt;

    &lt;p&gt;Instead, even if two services (lets say &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;galera&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rabbit&lt;/code&gt;) are
originally assigned to the same set of nodes, this should imply
nothing about how either of them can or should be scaled in the
future.&lt;/p&gt;

    &lt;p&gt;We want the ability to deploy a new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rabbit&lt;/code&gt; server without
requiring it host &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;galera&lt;/code&gt; too.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;scope&quot;&gt;Scope&lt;/h1&gt;

&lt;p&gt;This need only apply to non-OpenStack services, however it could be
extended to those as well if you were unconvinced by my other &lt;a href=&quot;/blog/2016/next-openstack-ha-arch&quot;&gt;recent
proposal&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At Red Hat, the list of services affected would be:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;HAProxy&lt;/li&gt;
  &lt;li&gt;Any VIPs&lt;/li&gt;
  &lt;li&gt;Galera&lt;/li&gt;
  &lt;li&gt;Redis&lt;/li&gt;
  &lt;li&gt;Mongo DB&lt;/li&gt;
  &lt;li&gt;Rabbit MQ&lt;/li&gt;
  &lt;li&gt;Memcached&lt;/li&gt;
  &lt;li&gt;openstack-cinder-volume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, if the deployment has been configured to provide &lt;a href=&quot;https://access.redhat.com/articles/1544823&quot;&gt;Highly
Available Instances&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;nova-compute-wait&lt;/li&gt;
  &lt;li&gt;nova-compute&lt;/li&gt;
  &lt;li&gt;nova-evacuate&lt;/li&gt;
  &lt;li&gt;fence-nova&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;proposed-solution&quot;&gt;Proposed Solution&lt;/h1&gt;

&lt;p&gt;In essance, I propose that there be a single native cluster,
consisting of between 3 (the minimum sane cluster size) and 16
(roughly Corosync’s out-of-the-box limit) nodes, augmented by a
collection of zero-or-more remote nodes.&lt;/p&gt;

&lt;p&gt;Both native and remote nodes will have roles assigned to them,
allowing Pacemaker to automagically move resources to the right
location based on the roles.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Note that all nodes, both native and remote, can have zero-or-more
roles and it is also possible to have a mixture of native and
remote nodes assigned to the same role.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This will allow us, by changing a few flags (and potentially adding
extra remote nodes to the cluster) go from a fully collapsed
deployment to a fully segregated one - and not only at install time.&lt;/p&gt;

&lt;p&gt;If installers wish to support it&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, this architecture can cope with
roles being split out (or recombined) after deployment and of course
the cluster wont need to be taken down and resources will move as
appropriate.&lt;/p&gt;

&lt;p&gt;Although there is no hard requirement that anything except the fencing
devices run on the native nodes, best practice would arguably dictate
that HAProxy and the VIPs be located there unless an external load
balancer is in use.&lt;/p&gt;

&lt;p&gt;The purpose of this would be to limit the impact of a hypothetical
pacemaker-remote bug.  Should such a bug exist, by virtue of being the
gateway to all the other APIs, HAProxy and the VIPs are the elements
one would least want to be affected.&lt;/p&gt;

&lt;p&gt;Some installers may even choose to enforce this in the configuration,
but “by convention” is probably sufficient.&lt;/p&gt;

&lt;h1 id=&quot;implementation-details&quot;&gt;Implementation Details&lt;/h1&gt;

&lt;p&gt;The key to this implementation is Pacemaker’s concept of &lt;a href=&quot;http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-node-attributes.html&quot;&gt;node
attributes&lt;/a&gt;
and
&lt;a href=&quot;http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_node_attribute_expressions.html&quot;&gt;expressions&lt;/a&gt;
that make use of them.&lt;/p&gt;

&lt;p&gt;Instance attributes can be created with commands of the form:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pcs property set --node controller-0 proxy-role=true
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
  &lt;p&gt;Note that this differs from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;osprole=compute/controller&lt;/code&gt; scheme
used in the &lt;a href=&quot;https://access.redhat.com/articles/1544823&quot;&gt;Highly Available
Instances&lt;/a&gt; instructions.
That arrangement wouldn’t work here as each node may have serveral
roles assigned to it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Under the covers, the result in Pacemaker’s configuration would look something like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;cib ...&amp;gt;
  &amp;lt;configuration&amp;gt;
    &amp;lt;nodes&amp;gt;
      &amp;lt;node id=&quot;1&quot; uname=&quot;controller-0&quot;&amp;gt;
        &amp;lt;instance_attributes id=&quot;controller-0-attributes&quot;&amp;gt;
          &amp;lt;nvpair id=&quot;controller-0-proxy-role&quot; name=&quot;proxy-role&quot; value=&quot;true&quot;/&amp;gt;
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These attributes can then be referenced in &lt;a href=&quot;http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_deciding_which_nodes_a_resource_can_run_on.html&quot;&gt;location
constraints&lt;/a&gt;
that restrict the resource to a subset of the available nodes based on
&lt;a href=&quot;http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_using_rules_to_determine_resource_location.html#_location_rules_based_on_other_node_properties&quot;&gt;certain
criteria&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example, we would use the following for HAProxy:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pcs constraint location haproxy-clone rule score=0 proxy-role eq true
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;which would create the following under the covers:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;rsc_location id=&quot;location-haproxy&quot; rsc=&quot;haproxy-clone&quot;&amp;gt;
  &amp;lt;rule id=&quot;location-haproxy-rule&quot; score=&quot;0&quot;&amp;gt;
    &amp;lt;expression id=&quot;location-haproxy-rule-expr&quot; attribute=&quot;proxy-role&quot; operation=&quot;eq&quot; value=&quot;true&quot;/&amp;gt;
  &amp;lt;/rule&amp;gt;
&amp;lt;/rsc_location&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Any node, native or remote, not meeting the criteria is automatically
eliminated as a possible host for the service.&lt;/p&gt;

&lt;p&gt;Pacemaker also defines some node attributes automatically based on a
node’s name and type.  These are also available for use in
constraints.  This allows us, for example, to force a resource such as
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nova-evacuate&lt;/code&gt; to run on a “real” cluster node with the command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pcs constraint location nova-evacuate rule score=0 &quot;#kind&quot; ne remote
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For deployments based on Pacemaker 1.1.15 or later, we can also
simplify the configuration by using pattern matching in our
constraints.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Restricting all the VIPs to nodes with the proxy role:&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &amp;lt;rsc_location id=&quot;location-haproxy-ips&quot; resource-discovery=&quot;exclusive&quot; rsc-pattern=&quot;^(ip-.*)&quot;/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Restricting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nova-compute&lt;/code&gt; to compute nodes (assuming a
standardized naming convention is used):&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &amp;lt;rsc_location id=&quot;location-nova-compute-clone&quot; resource-discovery=&quot;exclusive&quot; rsc-pattern=&quot;nova-compute-(.*)&quot;/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;final-result&quot;&gt;Final Result&lt;/h1&gt;

&lt;p&gt;This is what a fully active cluster would look like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;9 nodes configured
87 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
RemoteOnline: [ overcloud-compute-0 overcloud-compute-1
overcloud-compute-2 rabbitmq-extra-0 storage-0 storage-1 ]

 ip-172.16.3.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-192.0.2.17 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 ip-172.16.2.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 ip-172.16.2.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.16.1.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Slaves: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 ip-192.0.3.30 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Slaves: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 rabbitmq-extra-0 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 openstack-cinder-volume (systemd:openstack-cinder-volume): Started storage-0
 Clone Set: nova-compute-clone [nova-compute]
     Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ]
 Clone Set: nova-compute-wait-clone [nova-compute-wait]
     Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ]
 nova-evacuate (ocf::openstack:NovaEvacuate): Started overcloud-controller-0
 fence-nova (stonith:fence_compute): Started overcloud-controller-0
 storage-0 (ocf::pacemaker:remote): Started overcloud-controller-1
 storage-1 (ocf::pacemaker:remote): Started overcloud-controller-2
 overcloud-compute-0 (ocf::pacemaker:remote): Started overcloud-controller-0
 overcloud-compute-1 (ocf::pacemaker:remote): Started overcloud-controller-1
 overcloud-compute-2 (ocf::pacemaker:remote): Started overcloud-controller-2
 rabbitmq-extra-0 (ocf::pacemaker:remote): Started overcloud-controller-0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A small wish, but it would be nice if installers used meaningful names
for the VIPs instead the underlying IP addresses they manage.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;One reason they may not do so on day one, is the careful co-ordination that some services can require when there is no overlap between the old and new sets of nodes assigned to a given role. Galera is one specific case that comes to mind. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="cluster" />
      
        <category term="architecture" />
      
        <category term="openstack" />
      

      
        <summary>Composable roles are a hot topic, I present a proposal for how to accommodate cluster-managed services.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>Thoughts on HA for Multi-Subnet Deployments of OpenStack</title>
      <link href="https://www.beekhof.net/blog/2016/multi-subnet-ha-openstack" rel="alternate" type="text/html" title="Thoughts on HA for Multi-Subnet Deployments of OpenStack" />
      <published>2016-07-24T13:20:00+10:00</published>
      
        <updated>2016-07-24T13:20:00+10:00</updated>
      

      <id>https://www.beekhof.net/blog/2016/multi-subnet-ha-openstack</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2016/multi-subnet-ha-openstack">&lt;p&gt;In a normal deployment, in order to direct traffic to the same HAProxy
instance, Pacemaker will ensure that each VIP is configured on at most
one HAProxy machine.&lt;/p&gt;

&lt;p&gt;However in a spine and leaf network architecture, the nodes are in
multiple subnets and there may be a limitation that the machines can
not be part of a common L3 network that the VIPs could be added to.&lt;/p&gt;

&lt;p&gt;Once the traffic reaches HAProxy everything should JustWork(tm) -
modulo creating the appropriate networking rules, the problem is
getting it to the proxy.&lt;/p&gt;

&lt;p&gt;The approach to dealing with this will need to differ based on the
latencies that can be gaurenteed between every nodes in the cluster.
At Red Hat, we define LAN-like latencies to be 2ms or better -
consistently and between every node that would make up the cluster.&lt;/p&gt;

&lt;h1 id=&quot;low-latency-links&quot;&gt;Low Latency Links&lt;/h1&gt;

&lt;p&gt;You have more flexibility in low latency scenarios as the cluster
software can operate as designed.&lt;/p&gt;

&lt;p&gt;At a high level, the possible ways forward are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Decide the benefit isn’t worth it and create an L3 network just for
the VIPs to live on.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Put all the controllers into a single subnet.&lt;/p&gt;

    &lt;p&gt;Just be mindful of what will happen if the switch connecting them
goes bad.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Replace the HAProxy/VIP portion of the architecture with a load
balancer appliance that is accessible from the child subnets.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Move the HAProxy/VIP portion of the architecture into a dedicated
3-node cluster load balancer that is accessible from the child
subnets.&lt;/p&gt;

    &lt;p&gt;The new cluster would need the list of controllers and some health
checks which could be as simple as “is the controller up/down” or
as complex as “is service X up/down”.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Right now creating a load balancer near the network spine would have
to be an extra manual step for users of TripleO. However once
composable roles (the ability to mix and match which services go on
which nodes) are supported, it should be possible to install such a
configuration out of the box by placing three machines near the spine
and giving only them the “haproxy” role.&lt;/p&gt;

&lt;h1 id=&quot;higher-latency&quot;&gt;Higher Latency&lt;/h1&gt;

&lt;p&gt;Corosync has very strict latency requirements of no more than 2ms for
any of its links.  Assuming your installer can deploy across subnets,
the existance of such a link would be a barrier to the creation of a
highly available dpeloyment.&lt;/p&gt;

&lt;p&gt;To work around these requirements, we can use Pacemaker Remote to
extend Pacemaker’s ability to manage services on nodes separated by
higher latency links.&lt;/p&gt;

&lt;p&gt;In TripleO, the work needed to make this style of deployment possible
is already planned as part of our “HA for composable roles” design.&lt;/p&gt;

&lt;p&gt;As per option 4 of the low latency case, such a deployment would
consist of a three node cluster containing only HAProxy and some
floating IPs.&lt;/p&gt;

&lt;p&gt;The rest of the nodes that make up the traditional OpenStack
control-plane are managed as remote cluster nodes.  Meaning instead of
a traditional Corosync and Pacemaker stack, they have only the
pacemaker-remote daemon and do not participate in leader elections or
quorum calculations.&lt;/p&gt;

&lt;h2 id=&quot;external-load-balancers&quot;&gt;External Load Balancers&lt;/h2&gt;

&lt;p&gt;If you you wish to use a dedicated load balancer, then the 3-node
cluster would just co-ordinate the actions of the remote nodes and not
host any services locally.&lt;/p&gt;

&lt;p&gt;An installer may conceivably create them anyway but leave them
disabled to simplify the testing matrix.&lt;/p&gt;

&lt;h1 id=&quot;general-considerations&quot;&gt;General Considerations&lt;/h1&gt;

&lt;p&gt;The creation of a separate subnet or set of subnets for fencing is
highly encouraged.&lt;/p&gt;

&lt;p&gt;In general we want to avoid the possibility for a single network(ing)
failure that can take out communication to the both a set of nodes and
the device that can turn them off.&lt;/p&gt;

&lt;p&gt;Everything is HA a trade-off between the chance of a particular
failure occurring and the consequences if it ever actually happens.
Everyone will likely draw the line in a different place based on their
risk aversion, all I can do is make recommendations based on my
background in this field.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="cluster" />
      
        <category term="architecture" />
      
        <category term="openstack" />
      

      
        <summary>Installing OpenStack in a spine-and-leaf network presents problems for high availability, these are some thoughts on what do to about it.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>Working with OpenStack Images</title>
      <link href="https://www.beekhof.net/blog/2016/working-with-images" rel="alternate" type="text/html" title="Working with OpenStack Images" />
      <published>2016-06-20T12:45:00+10:00</published>
      
        <updated>2016-06-20T12:45:00+10:00</updated>
      

      <id>https://www.beekhof.net/blog/2016/working-with-images</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2016/working-with-images">&lt;h2 id=&quot;creating-images&quot;&gt;Creating Images&lt;/h2&gt;

&lt;p&gt;For creating images, I recommend the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virt-builder&lt;/code&gt; tool that ships
with RHEL based distributions and possibly others:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;virt-builder centos-7.2 --format qcow2 --install &quot;cloud-init&quot; --selinux-relabel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note the use of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--selinux-relabel&lt;/code&gt; option.  If you specify
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--install&lt;/code&gt; but do not include this option, you may end up with
instances that treats all attempts to log in as security violations
and blocks them.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloud-init&lt;/code&gt; package is incredibly useful (discussed later) but
isn’t available in CentOS images by default, so I recommend adding it
to any image you create.&lt;/p&gt;

&lt;p&gt;For the full list of supported targets, try &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virt-builder -l&lt;/code&gt;.
Targets should include CirrOS as well as several versions of openSUSE,
Fedora, CentOS, Debian, and Ubuntu.&lt;/p&gt;

&lt;h2 id=&quot;adding-packages-to-an-existing-image&quot;&gt;Adding Packages to an existing Image&lt;/h2&gt;

&lt;p&gt;On RHEL based distributions, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virt-customize&lt;/code&gt; tool is available
and makes adding a new package to an existing image simple.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;virt-customize  -v -a myImage --install &quot;wget,ntp&quot; --selinux-relabel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note once again the use of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--selinux-relabel&lt;/code&gt; option.  This
should only be used for the last step of your customization.  As
above, not doing so may result in an instance that treats all attempts
to log in as security violations and blocks them.&lt;/p&gt;

&lt;p&gt;Richard Jones also has a good post about &lt;a href=&quot;https://rwmj.wordpress.com/2015/10/03/tip-updating-rhel-7-1-cloud-images-using-virt-customize-and-subscription-manager/&quot;&gt;updating RHEL
images&lt;/a&gt;
since they require subscriptions. Just be sure to use
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--sm-unregister&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--selinux-relabel&lt;/code&gt; at the very end.&lt;/p&gt;

&lt;h2 id=&quot;logging-in&quot;&gt;Logging in&lt;/h2&gt;

&lt;p&gt;If you haven’t already, tell OpenStack about your keypair:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nova keypair-add myKey --pub-key ~/.ssh/id_rsa.pub
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now you can tell your provisioning tool to add it to the instances it
creates.  For &lt;a href=&quot;https://wiki.openstack.org/wiki/Heat&quot;&gt;Heat&lt;/a&gt;, the
template would look like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;myInstance:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    key_name: myKey
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However almost no image will let you log in, via ssh or on the console, as
root.  Instead they normally create a new user that has full &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo&lt;/code&gt;
access.  Red Hat images default to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloud-user&lt;/code&gt; while CentOS has a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;centos&lt;/code&gt; user.&lt;/p&gt;

&lt;p&gt;If you don’t already know which user your instance has, you can use
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nova console-log myImage&lt;/code&gt; to see what happens at boot time.&lt;/p&gt;

&lt;p&gt;Assuming you configured a key to add to the instance, you might see a
line such as:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ci-info: ++++++Authorized keys from /home/cloud-user/.ssh/authorized_keys for user cloud-user+++++++
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;which tells you which user your image supports.&lt;/p&gt;

&lt;h2 id=&quot;customizing-an-instance-at-boot-time&quot;&gt;Customizing an Instance at Boot Time&lt;/h2&gt;

&lt;p&gt;This section relies heavily on the
&lt;a href=&quot;https://launchpad.net/cloud-init&quot;&gt;cloud-init&lt;/a&gt; package.  If it is not
present in your images, be sure to add it using the techniques above
before trying anything below.&lt;/p&gt;

&lt;h3 id=&quot;running-scripts&quot;&gt;Running Scripts&lt;/h3&gt;

&lt;p&gt;Running scripts on the instances once they’re up can be a useful way
to customize your images, start services and generally work-around
bugs in officially provided images.&lt;/p&gt;

&lt;p&gt;The list of commands to run is specified as part of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_data&lt;/code&gt;
section of a &lt;a href=&quot;https://wiki.openstack.org/wiki/Heat&quot;&gt;Heat&lt;/a&gt; template or
can be passed to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nova boot&lt;/code&gt; with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--user-data&lt;/code&gt; option:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      #!/bin/sh -ex

      # Fix broken qemu/strstr()
      # https://bugzilla.redhat.com/show_bug.cgi?id=1269529#c9
      touch /etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note the extra options passed to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/bin/sh&lt;/code&gt; The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;e&lt;/code&gt; tells the script to
terminate if any command produces an error and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x&lt;/code&gt; tells the shell
to log everything that is being executed.  This is particularly useful
as it causes the script’s execution to be available in the console’s
log (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nova console-log myServer&lt;/code&gt;).&lt;/p&gt;

&lt;h3 id=&quot;when-scripts-take-a-really-long-time&quot;&gt;When Scripts Take a Really Long Time&lt;/h3&gt;

&lt;p&gt;If we have scripts that take a really long time, we may want to delay
the creation of subsequent resources until our instance is fully
configured.&lt;/p&gt;

&lt;p&gt;If we are using &lt;a href=&quot;https://wiki.openstack.org/wiki/Heat&quot;&gt;Heat&lt;/a&gt;, we can
set this up by creating SwiftSignal and SwiftSignalHandle resources to
coordinate resource creation with notifications/signals that could be
coming from sources external or internal to the stack.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;signal_handle:
  type: OS::Heat::SwiftSignalHandle

wait_on_server:
  type: OS::Heat::SwiftSignal
  properties:
    handle: {get_resource: signal_handle}
    count: 1
    timeout: 2000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We then add a layer of indirection to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_data:&lt;/code&gt; portion of the
instance definition using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;str_replace:&lt;/code&gt; function to replace all
occurences of “wc_notify” in the script with an appropriate curl PUT
request using the “curl_cli” attribute of the SwiftSignalHandle
resource.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      str_replace:
        params:
          wc_notify:   { get_attr: [&apos;signal_handle&apos;, &apos;curl_cli&apos;] }
        template: |
          #!/bin/sh -ex

          my_command_that --takes-a-really long-time

          wc_notify --data-binary &apos;{&quot;status&quot;: &quot;SUCCESS&quot;, &quot;data&quot;: &quot;Script execution succeeded&quot;}&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now the creation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;myNode&lt;/code&gt; will only be considered successful if and
when the script completes.&lt;/p&gt;

&lt;h3 id=&quot;installing-packages&quot;&gt;Installing Packages&lt;/h3&gt;

&lt;p&gt;One should avoid the temptation to hardcode calls to a specific
package manager as part of a script as it limits the usefulness of
your template.  Instead, this is done in a platform agnostic way using
the packages directive.&lt;/p&gt;

&lt;p&gt;Note that instance creation will not fail if packages fail to install
or are already present.  Check for any required binaries or files as
part of the script.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;user_data_format: RAW
user_data:
  #cloud-config
  # See http://cloudinit.readthedocs.io/en/latest/topics/examples.html
  packages:
    - ntp
    - wget
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that this will NOT work for images that need a Red Hat
subscription.  There is supposed to be a way to have it register,
however I’ve had &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=1340323&quot;&gt;no
success&lt;/a&gt; with
this method and instead I recommend you create a new image that has
any packages listed here pre-installed.&lt;/p&gt;

&lt;h3 id=&quot;installing-packages-and-running-scripts&quot;&gt;Installing Packages &lt;em&gt;and&lt;/em&gt; Running scripts&lt;/h3&gt;

&lt;p&gt;The first line of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_data:&lt;/code&gt; section (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#config&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#!/bin/sh&lt;/code&gt;)
is used to determine how it should be interpreted. So if we wish to
take advantage of scripting and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloud-init&lt;/code&gt;, we must combine the two
pieces into a multi-part MIME message.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloud-init&lt;/code&gt; docs include a &lt;a href=&quot;http://cloudinit.readthedocs.io/en/latest/topics/format.html#helper-script-to-generate-mime-messages&quot;&gt;MIME helper
script&lt;/a&gt;
to assist in the creation of complex &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_data:&lt;/code&gt; blocks.&lt;/p&gt;

&lt;p&gt;Simply create a file for each section and invoke with a command line similar to:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python ./mime.py cloud.config:text/cloud-config cloud.sh:text/x-shellscript
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting output can then be pasted in as a template and even
edited in-place later.  Here is an example that includes notification
for a long running process:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;user_data_format: RAW
user_data:
  str_replace:
    params:
      wc_notify:   { get_attr: [&apos;signal_handle&apos;, &apos;curl_cli&apos;] }
    template: |
      Content-Type: multipart/mixed; boundary=&quot;===============3343034662225461311==&quot;
      MIME-Version: 1.0
      
      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/cloud-config; charset=&quot;us-ascii&quot;
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename=&quot;cloud.config&quot;

      #cloud-config
      packages:
        - ntp
        - wget

      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/x-shellscript; charset=&quot;us-ascii&quot;
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename=&quot;cloud.sh&quot;
      
      #!/bin/sh -ex

      my_command_that --takes-a-really long-time

      wc_notify --data-binary &apos;{&quot;status&quot;: &quot;SUCCESS&quot;, &quot;data&quot;: &quot;Script execution succeeded&quot;}&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="heat" />
      
        <category term="openstack" />
      

      
        <summary>Some tips for creating and extending OpenStack images</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      <title>Evolving the OpenStack HA Architecture</title>
      <link href="https://www.beekhof.net/blog/2016/next-openstack-ha-arch" rel="alternate" type="text/html" title="Evolving the OpenStack HA Architecture" />
      <published>2016-06-07T14:06:00+10:00</published>
      
        <updated>2016-06-07T14:06:00+10:00</updated>
      

      <id>https://www.beekhof.net/blog/2016/next-openstack-ha-arch</id>
      <content type="html" xml:base="https://www.beekhof.net/blog/2016/next-openstack-ha-arch">&lt;p&gt;In the current OpenStack HA architecture used by Red Hat, SuSE and
others, Systemd is the entity in charge of starting and stopping most
OpenStack services.  Pacemaker exists as a layer on top, signalling
when this should happen, but Systemd is the part making it happen.&lt;/p&gt;

&lt;p&gt;This is a valuable contribution for active/passive (A/P) services and
those that require all their dependancies be available during their
startup and shutdown sequences.  However as OpenStack matures, more
and more components are able to operate in an unconstrained
active/active capacity with little regard for the startup/shutdown
order of their peers or dependancies - making them well suited to be
managed by Systemd.&lt;/p&gt;

&lt;p&gt;For this reason, a future revision of the HA architecture should limit
Pacemaker’s involvement to core services like Galera and Rabbit as
well as the few remaining OpenStack services that run A/P.&lt;/p&gt;

&lt;p&gt;This would be particularly useful as we look towards a containerised
future.  It both allows OpenStack to play nicely with the current
generation of container managers which lack Orchestration, as well as
reduces recovery and downtime by allowing for the maximum parallelism.&lt;/p&gt;

&lt;p&gt;Divesting most OpenStack services from the cluster also removes
Pacemaker as a potential obstacle for moving them to WSGI.  It is
as-yet unclear if services will live under a single Apache instance or
many and the former would conflict with Pacemaker’s model of starting,
stopping and monitoring services as individual components.&lt;/p&gt;

&lt;h2 id=&quot;objection-1---pacemaker-as-an-alerting-mechanism&quot;&gt;Objection 1 - Pacemaker as an Alerting Mechanism&lt;/h2&gt;

&lt;p&gt;Using Pacemaker as an alerting mechanism for a large software stack is
of limited value.  Of course Pacemaker needs to know when a service
dies but it necessarily takes action straight away, not wait around to
see if there will be any others with which it can correlate a root
cause.&lt;/p&gt;

&lt;p&gt;In large complex software stacks, the recovery and alerting components
should not be the same thing because they do and should operate on
different timescales.&lt;/p&gt;

&lt;p&gt;Pacemaker also has no way to include the context of a failure in an
alert and thus no way to report the difference between Nova failing
and Nova failing because Keystone is dead.  Indeed Keystone being the
root cause could be easily lost in a deluge of notifications about the
failure of services that depend on it.&lt;/p&gt;

&lt;p&gt;For this reason, as the number of services and dependancies grow,
Pacemaker makes a poor substitute for a well configured monitoring and
alerting system (such as Nagios or Sensu) that can also integrate
hardware and network metrics.&lt;/p&gt;

&lt;h2 id=&quot;objection-2---pacemaker-has-better-monitoring&quot;&gt;Objection 2 - Pacemaker has better Monitoring&lt;/h2&gt;

&lt;p&gt;Pacemaker’s native ability to monitor services is more flexible than
Systemd’s which relies on a “PID up == service healthy” mode of
thinking &lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;However, just as Systemd is the entity performing the startup and
shutdown of most OpenStack services, it is also the one performing the
actual service health checks.&lt;/p&gt;

&lt;p&gt;To actually take advantage of Pacemaker’s monitoring capabilities, you
would need to write Open Cluster Framework (OCF) agents &lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; for every
OpenStack service. While this would not take a rocket scientist to
achieve, it is an opportunity for the way services are started in a
clustered and non-clustered environment to diverge.&lt;/p&gt;

&lt;p&gt;So while it may feel good to look at a cluster and see that Pacemaker
is configured to check the health of a service every N seconds, all
that really achieves is to sync Pacemaker’s understanding of the
service with what Systemd already knew.  In practice, on average, this
ends up delaying recovery by N/2 seconds instead of making it faster.&lt;/p&gt;

&lt;h2 id=&quot;bonus-round---activepassive-ftw&quot;&gt;Bonus Round - Active/Passive FTW&lt;/h2&gt;

&lt;p&gt;Some people have the impression that A/P is a better or simpler mode
of operation for services and in this was justify the continued use of
Pacemaker to manage OpenStack services.&lt;/p&gt;

&lt;p&gt;Support for A/P configurations is important, it allows us to make
applications that are in no way cluster-aware more available by
reducing the requirements on the application to almost zero.&lt;/p&gt;

&lt;p&gt;However, the downside is slower recovery as the service must be
bootstrapped on the passive node, which implies increased downtime.
So at the point the service becomes smart enough to run in an
unconstrained A/A configuration, you are better off to do so - with or
without a cluster manager.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Watchdog-like functionality is only a variation on this, it only tells you that the thread responsible for heartbeating to Systemd is alive and well - not if the APIs it exposes are functioning. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Think SYS-V init scripts with some extra capabilities and requirements particular to clustered/automated environment.  It’s a standard historically supported by the Linux Foundation but hasn’t caught on much since it was created in the late 90’s. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content>

      
      
      
      
      

      
        <author>
            <name>Andrew Beekhof</name>
          
            <email>andrew@beekhof.net</email>
          
          
        </author>
      

      

      
        <category term="cluster" />
      
        <category term="architecture" />
      
        <category term="openstack" />
      

      
        <summary>A future revision of the HA architecture should limit Pacemaker involvement to services like Galera, Rabbit and the few remaining OpenStack services that can only run active/passive.</summary>
      

      
      
    </entry>
  
  
</feed>
