Savaged by Softdog, a Cautionary Tale
Hardware is imperfect, and software contains bugs. Don’t use software based watchdogs and expect to survive the latter.
Baseline: without probes 28s
Use hashtables instead of lists for stores the available nodes for a resource
New time without probes: 18s
Defer creation of deletion,promote and demote constraints until they are needed
New time without probes: 13s
Use g_list_prepend() instead of g_list_append() for the list of ordering constraints
New time without probes: 5s
New algorithm for determining which clone instances need probing
New time with probes: 31s
The CIB was harder to profile. Rather than give it one large task to chew through and see how long it took using a few printf’s to provide granularity, I had to run it through a profiler while it was operating in a real cluster and see where most of the time was being spent.
Remove most uses of cib_msg_copy(), reduced the amount of needless copying.
Phase speedup: 10%
Compression costs a LOT, don’t do it unless we’re hitting message limits. For now, use 256k as the threshold at which compression kicks in. The previous limit was 10k, compressing 184 of 1071 messages accounted for 23% of the total CPU used by the cib.
Each time we validated the CIB, we were re-reading and re-parsing the RelaxNG schema, which accounted for 28% of the CIB’s CPU usage on the DC. We now read it once and cache the result for the life of the CIB process.
Phase speedup: 51%
Push detection of group and set ordering changes to (the less busy) slave instances. This detection was costing 15% of the CIB’s total CPU time on the DC.
Phase speedup: 15%
The majority of CPU spent by the CIB is in post-processing.