Why Wont the Cluster Start my Services?

Its a common question and a worthy topic for an extended article. Here’s the steps I usually follow when diagnosing such issues.

Is the cluster allowed to start services?

Check quorum status with crm_mon –one-shot
Quorum is a property of the cluster which is attained when more than half the number of known nodes is online. Unlike Heartbeat, OpenAIS based clusters don’t pretend to have quorum when only one of a possible two nodes is available. In such situations, the cluster’s default behavior is to ensure data integrity by stopping all services.
Check the current value of the no-quorum-policy option:
```
 crm_attribute -n no-quorum-policy -G
```
If you don’t have quorum, you can tell the cluster to ignore the loss of quorum and start resources anyway:
```
 crm_attribute -n no-quorum-policy -v ignore
```
Be careful to ensure STONITH is correctly configured before using the ignore option.
Check if the cluster is managing services:
Check the global default
```
 crm_attribute --type rsc_defaults -n is-managed -G
```
Check the per-resource values
```
 cibadmin --query --xpath //nvpair[@name="is-managed"]
```
Check the old location for the global default
```
 crm_attribute -n is-managed-default -G
```
Look for any results indicating a value of false
Check target-role
The target-role setting controls what state the resource can achieve. The list of possible states is:
- Stopped
- Started
- Slave
- Master Look out for any places indicating a value of Stopped. In the case of master/slave resources that aren’t being promoted, a value of Started can also be problematic.
Check the global default
```
 crm_attribute --type rsc_defaults -n target-role -G
```
Check the per-resource values
```
 cibadmin --query --xpath //nvpair[@name="target-role"]
```

Look for failures

You can see the list of failures in the crm_mon output:
```
 crm_mon --one-shot --operations
```
Another good source of information is ptest which can simulate what the cluster would try to do.
```
 ptest --live-check -VVV
```
Look for anything unusual in the output such as
WARN: unpack_rsc_op: Processing failed op drbd0:1_start_0 on nagios-clu2: unknown error

Check the logs

 ssh -l root nagios-clu2 -- grep drbd0:1 /var/log/messages

Cleaning up after failures

If you identified any failures above, you can instruct the cluster to “forget” about them:

	crm_resource --cleanup --node nagios-clu2

This results in the resource history being erased on nagios-clu2. The cluster will then attempt to start any services that were not already active.

NOTE: This will have little or no benefit if the underlying issue, the one that caused the resource to fail in the first place, has not been fixed. If the problem persists, the resource will simply return to a failed state and the cluster will still refuse to start it.

In a later article, I’ll explain how the cluster can recover from transient failures automatically by timing them out after a certain interval.

Share on

Twitter Facebook Google+ LinkedIn

Why Wont the Cluster Start my Services?

Andrew Beekhof

Is the cluster allowed to start services?

Look for failures

Cleaning up after failures

Share on

You May Also Enjoy

Savaged by Softdog, a Cautionary Tale

A New Fencing Mechanism (TBD)

A New Thing

Two Nodes - The Devil is in the Details

Why Wont the Cluster Start my Services?

Andrew Beekhof

Tag Cloud

Is the cluster allowed to start services?

Look for failures

Cleaning up after failures

Share on

You May Also Enjoy

Savaged by Softdog, a Cautionary Tale

A New Fencing Mechanism (TBD)

A New Thing

Two Nodes - The Devil is in the Details