Where to start
The first thing to do is look in syslog
for the terms ERROR and WARN, eg.
grep -e ERROR -e WARN /var/log/messages
If nothing looks appropriate, find the logs from crmd
grep -e crmd\\[ -e crmd: /var/log/messages
If you do not see anything from crmd
, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging for how to obtain more detail.
Although the
crmd
process is active on all cluster nodes, decisions are only occuring on one of them. The “DC” is chosen through thecrmd
’s election process and may move around as nodes leave/join the cluster.
For node failures, you’ll always want the logs from the DC (or the node that becomes the DC).
For resource failures, you’ll want the logs from the DC and the node on which the resource failed.
Log entries like:
crmd[1811]: notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)
indicate a node is no longer part of the cluster (either because it failed or was shut down)
crmd[1811]: notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now member (was lost)
indicates a node has (re)joined the cluster
crmd[1811]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE ...
crmd[1811]: notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
crmd[1811]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE ...
indicates recovery was attempted
crmd[1811]: notice: te_rsc_command: Initiating action 36: monitor www_monitor_0 on corosync-host-5
crmd[1811]: notice: te_rsc_command: Initiating action 54: monitor mysql_monitor_10000 on corosync-host-4
indicates we performed a resource action, in this case we are checking the status of the www
resource on corosync-host-5
and starting a recurring health check for mysql
on corosync-host-4
.
crmd[1811]: notice: te_fence_node: Executing reboot fencing operation (83) on corosync-host-8 (timeout=60000)
indicates that we are attempting to fence corosync-host-8
.
crmd[1811]: notice: tengine_stonith_notify: Peer corosync-host-8 was terminated (st_notify_fence) by corosync-host-1 for corosync-host-1: OK
indicates that corosync-host-1
successfully fenced corosync-host-8
.
Node-level failures
Did the
crmd
fail to notice the failure?If you do not see any entries from
crm_update_peer_state()
, check the corosync logs to see if membership was correct/timelyDid the
crmd
fail to initiate recovery?If you do not see entries from
do_state_transition()
andrun_graph()
, then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why thecrmd
ignored the failure.Did the
crmd
fail to perform recovery?If you DO see entries from
do_state_transition()
but therun_graph()
entry(ies) include the textComplete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0
, then the cluster did not think it needed to do anything.Obtain the file named by
run_graph()
(eg./var/lib/pacemaker/pengine/pe-input-473.bz2
) either directly or from acrm_report
archive and continue to debugging the Policy Engine.Was fencing attempted?
Check if the
stonith-enabled
property is set to true/1/yes, if so obtain file named byrun_graph()
(eg./var/lib/pacemaker/pengine/pe-input-473.bz2
) either directly or from acrm_report
archive and continue to debugging the Policy Engine.Did fencing complete?
Check the configuration of fencing resources and if so proceed to Debugging Stonith.
Resource-level failures
Did the resource actually fail?
If not, check for logs matching the resource name to see why the resource agent thought a failure occurred.
Check the resource agent source to see what code paths could have produced those logs (or the lack of them)
Did
crmd
notice the resource failure?If not, check for logs matching the resource name to see if the resource agent noticed.
Check a recurring monitor was configured.
Did the
crmd
fail to initiate recovery?If you do not see entries from
do_state_transition()
andrun_graph()
, then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why thecrmd
ignored the failure.Did the
crmd
fail to perform recovery?If you DO see entries from
do_state_transition()
but therun_graph()
entry(ies) include the textComplete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0
, then the cluster did not think it needed to do anything.Obtain the file named by
run_graph()
(eg./var/lib/pacemaker/pengine/pe-input-473.bz2
) either directly or from acrm_report
archive and continue to debugging the Policy Engine.Did resources stop/start/move unexpectedly or fail to stop/start/move when expected?
Obtain the file named by
run_graph()
(eg./var/lib/pacemaker/pengine/pe-input-473.bz2
) either directly or from acrm_report
archive and continue to debugging the Policy Engine.