Resource Migration and Regression Testing

Yesterday I was working on a migration bug.

It didn’t take long to identify or fix, and afterwards I was terribly pleased with myself.

The fix was simple, elegant and allowed the cluster to use migration (instead of stop then start) more often.

Why had I not seen how easy it was sooner?

Unfortunately it was because I’d ignored half the problem.

One decision I’m particularly happy I made place 6 years ago, is the one when I ensured Pacemaker’s Policy Engine could be used outside of a running cluster. Combined with some additional output options, this makes it possible to have a suite (224 right at this moment) of regression tests that catch this sort of idiocy before it ever affects an actual user.

The Policy Engine is by far the most complex part of the system, and it’s totally infeasible to test by hand even a small fraction what the regression suite can (and in under 30s too!).

This is why I’m so confident when I say that each release is better than the last. Once a Policy Engine bug gets fixed, it stays fixed.

The benefits of offline testing also occur much earlier in the process. The Policy Engine keeps a rolling list of the cluster states it performed calculations on and our test reporting tool collects these. So when users report a Policy Engine bug, there is no need to reproduce the issue and afterwards we can conclusively show (using pretty dot graphs of the cluster’s old and new behavior) that the issue is resolved.

So I sat down again today and made sure I’d thought the whole problem through

so that the next version would be a complete solution. You can see my notes below if you’re interested.

And now I’m off to implement it (and some extra tests :-)

Migration Scenario Notes

Cluster Setup

primitive(A) depends on clone(B)

Resource Activity During Move: A(node-1 to node-2)

time	node-1	node-2	node-3
t0	A.stop
t1	B.stop		B.stop
t2		B.start	B.start
t3		A.start

Resource Activity During Migration: A(node-1 to node-2)

time	node-1	node-2	node-3
t0		B.start	B.start
t1	A.stop*
t2		A.start**
t3	B.stop		B.stop

Node *: Rewritten to be a migrate-to operation
Node **: Rewritten to be a migrate-from operation

Constraints

The following constraints already exist in the system. The ‘ok’ and ‘fail’ column refers to whether they still hold for migration.

A.stop -> A.start - ok
B.stop -> B.start - fail
A.stop -> B.stop - ok
B.start -> A.start - ok
B.stop -> A.start - fail
A.stop -> B.start - fail

Scenarios

B unchanged - ok
B stopping only - fail - possible after reversing constraint 5
B starting only - fail - possible after reversing constraint 6***
B stoping and starting - fail - constraint 2 is unfixable
B restarting but only on N2 - fail - as per case 4 but even less likely

Note ***: This is what the existing implementation does

Share on

Twitter Facebook Google+ LinkedIn

Resource Migration and Regression Testing

Andrew Beekhof

Migration Scenario Notes

Cluster Setup

Resource Activity During Move: A(node-1 to node-2)

Resource Activity During Migration: A(node-1 to node-2)

Constraints

Scenarios

Share on

You May Also Enjoy

Savaged by Softdog, a Cautionary Tale

A New Fencing Mechanism (TBD)

A New Thing

Two Nodes - The Devil is in the Details

Resource Migration and Regression Testing

Andrew Beekhof

Tag Cloud

Migration Scenario Notes

Cluster Setup

Resource Activity During Move: A(node-1 to node-2)

Resource Activity During Migration: A(node-1 to node-2)

Constraints

Scenarios

Share on

You May Also Enjoy

Savaged by Softdog, a Cautionary Tale

A New Fencing Mechanism (TBD)

A New Thing

Two Nodes - The Devil is in the Details