Savaged by Softdog, a Cautionary Tale
Hardware is imperfect, and software contains bugs. Don’t use software based watchdogs and expect to survive the latter.
It takes time for nova to accept the evacuation calls, there is a window for some to be lost if this node dies too.
Hopefully I’ve demonstrated the difficulty of adding pets (highly available instances) as an after thought. All it took to derail the efforts here was the seemingly innocuous decision that the admin should be responsible for retrying failed evacuations (based on it not having appeared somewhere else after a while?). Who knows what similar assumptions are still lurking.
At this point, people are probably expecting that I put my Pacemaker hat on and advocate for it to be given responsibility for all the pets. Sure we could do it, we could use nova APIs to managed them just like we do when people use their hypervisors directly.
But that’s never going to happen, so lets look at the alternatives. I foresee three main options:
First class support for pets in nova
Seriously, the scheduler is the best place for all this, it has all the info to make decisions and the ability to make them happen.
First class support for pets in something that replaces nova
If the technical debt or political situation is such that nova cannot move in this direction, perhaps someone else might.
Creation of a distributed finite state machine that:
The cluster community has pretty much all the tech needed for the last option, but it is mostly written in C so I expect that someone will replicate it all in Python or Go :-)
If anyone is interested in pursuing capabilities in this area and would like to benefit from the knowledge that comes with 14 years experience writing cluster managers, drop me a line.