1
0
Fork 0
mirror of https://github.com/ansible-collections/community.general.git synced 2026-02-04 07:51:50 +00:00
community.general/plugins/modules/monit.py
Alexei Znamensky ac37544c53
monit: investigating tests again - using copilot on this one (#11255)
* add monit version to successful exit

* install the standard monit - if 5.34, then bail out

* add 3sec wait after service restart

- that restart happens exactly before the task receiving the SIGTERM, so maybe, just maybe, it just needs time to get ready for the party

* wait for monit initialisation after restart

* monit tests: check service-specific status in readiness wait

The wait task was checking 'monit status' (general), but the actual
failing command is 'monit status -B httpd_echo' (service-specific).
This causes a race where general status succeeds but service queries
fail. Update to check the exact command format that will be used.

* monit tests: remove 5.34.x version restriction

The version restriction was based on incorrect diagnosis. The actual
issue was the readiness check validating general status instead of
service-specific queries. Now that we check the correct command
format, the tests should work across all monit versions.

* monit tests: add stabilization delay after readiness check

After the readiness check succeeds, add a 1-second pause before
running actual tests. Monit 5.34.x and 5.35 appear to have a
concurrency issue where rapid successive 'monit status -B' calls
can cause hangs even though the first call succeeds.

* monit tests: add retry logic for state changes to handle monit daemon hangs

Monit daemon has an intermittent concurrency bug across versions 5.27-5.35
where 'monit status -B' commands can hang (receiving SIGTERM) even after
the daemon has successfully responded to previous queries. This appears
to be a monit daemon issue, not a timing problem.

Add retry logic with 2-second delays to the state change task to work
around these intermittent hangs. Skip retries if the failure is not
SIGTERM (rc=-15) to avoid masking real errors.

* monit tests: capture and display monit.log for debugging

Add tasks in the always block to capture and display the monit log file.
This will help diagnose the intermittent hanging issues by showing what
monit daemon was doing when 'monit status -B' commands hang.

* monit tests: enable verbose logging (-v flag)

Modify the monit systemd service to start with -v flag for verbose
logging. This should provide more detailed information in the monit
log about what's happening when status commands hang.

* monit: add 0.5s delay after state change command

After extensive testing and analysis with verbose logging enabled, identified
that monit's HTTP interface can become temporarily unresponsive immediately
after processing state change commands (stop, start, restart, etc.).

This manifests as intermittent SIGTERM (rc=-15) failures when the module
calls 'monit status -B <service>' to verify the state change. The issue
affects all monit versions tested (5.27-5.35) and is intermittent, suggesting
a race condition or brief lock in monit's HTTP request handling.

Verbose logging confirmed:
- State change commands complete successfully
- HTTP server reports as 'started'
- But subsequent status checks can hang without any log entry

Adding a 0.5 second sleep after sending state change commands gives the
monit daemon time to fully process the command and become responsive again
before the first status verification check.

This complements the existing readiness check after daemon restart and
the retry logic for SIGTERM failures in the tests.

* tests(monit): remove workarounds after module race condition fix

After 10+ successful CI runs with no SIGTERM failures, removing test-level
workarounds that are now redundant due to the 0.5s delay fix in the module:

- Remove 1-second stabilization pause after daemon restart
  The module's built-in 0.5s delay after state changes makes this unnecessary

- Remove retry logic for SIGTERM failures in state change tests
  The race condition is now prevented at the module level

- Remove verbose logging setup and log capture
  Verbose mode didn't log HTTP requests, so it didn't help diagnose the issue
  and adds unnecessary overhead

Kept the readiness check with retries after daemon restart - still needed
to validate daemon is responsive after service restart (different scenario
than the state change race condition).

* restore tasks/main.yml

* monit tests: reduce readiness check retries from 60 to 10

After successful CI runs, observed that monit daemon becomes responsive
within 1-2 seconds after restart. The readiness check typically passes
on the first attempt.

Reducing from 60 retries (30s timeout) to 10 retries (5s timeout) is
more appropriate and allows tests to fail faster if something is
genuinely broken.

* add changelog frag

* Update changelogs/fragments/11255-monit-integrationtests.yml

Co-authored-by: Felix Fontein <felix@fontein.de>

---------

Co-authored-by: Felix Fontein <felix@fontein.de>
2025-12-10 13:29:32 +01:00

375 lines
12 KiB
Python

#!/usr/bin/python
# Copyright (c) 2013, Darryl Stoflet <stoflet@gmail.com>
# GNU General Public License v3.0+ (see LICENSES/GPL-3.0-or-later.txt or https://www.gnu.org/licenses/gpl-3.0.txt)
# SPDX-License-Identifier: GPL-3.0-or-later
from __future__ import annotations
DOCUMENTATION = r"""
module: monit
short_description: Manage the state of a program monitored using Monit
description:
- Manage the state of a program monitored using Monit.
extends_documentation_fragment:
- community.general.attributes
attributes:
check_mode:
support: full
diff_mode:
support: none
options:
name:
description:
- The name of the C(monit) program/process to manage.
required: true
type: str
state:
description:
- The state of service.
required: true
choices: ["present", "started", "stopped", "restarted", "monitored", "unmonitored", "reloaded"]
type: str
timeout:
description:
- If there are pending actions for the service monitored by monit, then it checks for up to this many seconds to verify
the requested action has been performed. The module sleeps for five seconds between each check.
default: 300
type: int
author:
- Darryl Stoflet (@dstoflet)
- Simon Kelly (@snopoke)
"""
EXAMPLES = r"""
- name: Manage the state of program httpd to be in started state
community.general.monit:
name: httpd
state: started
"""
import time
import re
from enum import Enum
from ansible.module_utils.basic import AnsibleModule
STATE_COMMAND_MAP = {
"stopped": "stop",
"started": "start",
"monitored": "monitor",
"unmonitored": "unmonitor",
"restarted": "restart",
}
MONIT_SERVICES = ["Process", "File", "Fifo", "Filesystem", "Directory", "Remote host", "System", "Program", "Network"]
class StatusValue(Enum):
MISSING = "missing"
OK = "ok"
NOT_MONITORED = "not_monitored"
INITIALIZING = "initializing"
DOES_NOT_EXIST = "does_not_exist"
EXECUTION_FAILED = "execution_failed"
class Status:
"""Represents a monit status with optional pending state."""
def __init__(self, status_val: str | StatusValue, is_pending: bool = False):
if isinstance(status_val, StatusValue):
self.state = status_val
else:
self.state = getattr(StatusValue, status_val)
self.is_pending = is_pending
@property
def value(self):
return self.state.value
def pending(self):
"""Return a new Status instance with is_pending=True."""
return Status(self.state, is_pending=True)
def __getattr__(self, item):
if item.startswith("is_"):
status_name = item[3:].upper()
if hasattr(StatusValue, status_name):
return self.value == getattr(StatusValue, status_name).value
raise AttributeError(item)
def __eq__(self, other):
if not isinstance(other, Status):
return False
return self.state == other.state and self.is_pending == other.is_pending
def __str__(self):
return f"{self.value}{' (pending)' if self.is_pending else ''}"
def __repr__(self):
return f"<{self}>"
# Initialize convenience class attributes
class Monit:
def __init__(self, module, monit_bin_path, service_name, timeout):
self.module = module
self.monit_bin_path = monit_bin_path
self.process_name = service_name
self.timeout = timeout
self._monit_version = None
self._raw_version = None
self._status_change_retry_count = 6
def monit_version(self):
if self._monit_version is None:
self._raw_version, version = self._get_monit_version()
# Use only major and minor even if there are more these should be enough
self._monit_version = version[0], version[1]
return self._monit_version
def _get_monit_version(self):
rc, out, err = self.module.run_command([self.monit_bin_path, "-V"], check_rc=True)
version_line = out.split("\n")[0]
raw_version = re.search(r"([0-9]+\.){1,2}([0-9]+)?", version_line).group()
return raw_version, tuple(map(int, raw_version.split(".")))
def exit_fail(self, msg, status=None, **kwargs):
kwargs.update(
{
"msg": msg,
"monit_version": self._raw_version,
"process_status": str(status) if status else None,
}
)
self.module.fail_json(**kwargs)
def exit_success(self, state):
self.module.exit_json(
changed=True,
name=self.process_name,
monit_version=self._raw_version,
state=state,
)
@property
def command_args(self):
return ["-B"] if self.monit_version() > (5, 18) else []
def get_status(self, validate=False) -> Status:
"""Return the status of the process in monit.
:@param validate: Force monit to re-check the status of the process
"""
monit_command = "validate" if validate else "status"
check_rc = not validate # 'validate' always has rc = 1
command = [self.monit_bin_path, monit_command] + self.command_args + [self.process_name]
rc, out, err = self.module.run_command(command, check_rc=check_rc)
return self._parse_status(out, err)
def _parse_status(self, output, err) -> Status:
escaped_monit_services = "|".join([re.escape(x) for x in MONIT_SERVICES])
pattern = f"({escaped_monit_services}) '{re.escape(self.process_name)}'"
if not re.search(pattern, output, re.IGNORECASE):
return Status("MISSING")
status_val_find = re.findall(r"^\s*status\s*([\w\- ]+)", output, re.MULTILINE)
if not status_val_find:
self.exit_fail("Unable to find process status", stdout=output, stderr=err)
status_val = status_val_find[0].strip().upper()
if " | " in status_val:
status_val = status_val.split(" | ")[0]
if " - " not in status_val:
status_val = status_val.replace(" ", "_")
# Normalize RUNNING to OK (monit reports both, they mean the same thing)
if status_val == "RUNNING":
status_val = "OK"
try:
return Status(status_val)
except AttributeError:
self.module.warn(f"Unknown monit status '{status_val}', treating as execution failed")
return Status("EXECUTION_FAILED")
else:
status_val, substatus = status_val.split(" - ")
action, state = substatus.split()
if action in ["START", "INITIALIZING", "RESTART", "MONITOR"]:
status = Status("OK")
else:
status = Status("NOT_MONITORED")
if state == "PENDING":
status = status.pending()
return status
def is_process_present(self):
command = [self.monit_bin_path, "summary"] + self.command_args
rc, out, err = self.module.run_command(command, check_rc=True)
return bool(re.findall(rf"\b{self.process_name}\b", out))
def is_process_running(self):
return self.get_status().is_ok
def run_command(self, command):
"""Runs a monit command, and returns the new status."""
return self.module.run_command([self.monit_bin_path, command, self.process_name], check_rc=True)
def wait_for_status_change(self, current_status):
running_status = self.get_status()
if running_status.value != current_status.value or current_status.value == StatusValue.EXECUTION_FAILED:
return running_status
loop_count = 0
while running_status.value == current_status.value:
if loop_count >= self._status_change_retry_count:
self.exit_fail("waited too long for monit to change state", running_status)
loop_count += 1
time.sleep(0.5)
validate = loop_count % 2 == 0 # force recheck of status every second try
running_status = self.get_status(validate)
return running_status
def wait_for_monit_to_stop_pending(self, current_status=None):
"""Fails this run if there is no status or it is pending/initializing for timeout"""
timeout_time = time.time() + self.timeout
if not current_status:
current_status = self.get_status()
waiting_status = [
StatusValue.MISSING,
StatusValue.INITIALIZING,
StatusValue.DOES_NOT_EXIST,
]
while current_status.is_pending or (current_status.state in waiting_status):
if time.time() >= timeout_time:
self.exit_fail('waited too long for "pending", or "initiating" status to go away', current_status)
time.sleep(5)
current_status = self.get_status(validate=True)
return current_status
def reload(self):
rc, out, err = self.module.run_command([self.monit_bin_path, "reload"])
if rc != 0:
self.exit_fail("monit reload failed", stdout=out, stderr=err)
self.exit_success(state="reloaded")
def present(self):
self.run_command("reload")
timeout_time = time.time() + self.timeout
while not self.is_process_present():
if time.time() >= timeout_time:
self.exit_fail('waited too long for process to become "present"')
time.sleep(5)
self.exit_success(state="present")
def change_state(self, state: str, expected_status: StatusValue, invert_expected: bool | None = None):
current_status = self.get_status()
self.run_command(STATE_COMMAND_MAP[state])
# Give monit daemon a moment to process the command before checking status
# to avoid race condition where HTTP interface may be temporarily unresponsive
time.sleep(0.5)
status = self.wait_for_status_change(current_status)
status = self.wait_for_monit_to_stop_pending(status)
status_match = status.state == expected_status
if invert_expected:
status_match = not status_match
if status_match:
self.exit_success(state=state)
self.exit_fail(f"{self.process_name} process not {state}", status)
def stop(self):
self.change_state("stopped", expected_status=StatusValue.NOT_MONITORED)
def unmonitor(self):
self.change_state("unmonitored", expected_status=StatusValue.NOT_MONITORED)
def restart(self):
self.change_state("restarted", expected_status=StatusValue.OK)
def start(self):
self.change_state("started", expected_status=StatusValue.OK)
def monitor(self):
self.change_state("monitored", expected_status=StatusValue.NOT_MONITORED, invert_expected=True)
def main():
arg_spec = dict(
name=dict(required=True),
timeout=dict(default=300, type="int"),
state=dict(
required=True,
choices=["present", "started", "restarted", "stopped", "monitored", "unmonitored", "reloaded"],
),
)
module = AnsibleModule(argument_spec=arg_spec, supports_check_mode=True)
name = module.params["name"]
state = module.params["state"]
timeout = module.params["timeout"]
monit = Monit(module, module.get_bin_path("monit", True), name, timeout)
def exit_if_check_mode():
if module.check_mode:
module.exit_json(changed=True)
if state == "reloaded":
exit_if_check_mode()
monit.reload()
present = monit.is_process_present()
if not present and state != "present":
module.fail_json(msg=f"{name} process not presently configured with monit", name=name)
if state == "present":
if present:
module.exit_json(changed=False, name=name, state=state)
exit_if_check_mode()
monit.present()
monit.wait_for_monit_to_stop_pending()
running = monit.is_process_running()
if running and state in ["started", "monitored"]:
module.exit_json(changed=False, name=name, state=state)
if running and state == "stopped":
exit_if_check_mode()
monit.stop()
if running and state == "unmonitored":
exit_if_check_mode()
monit.unmonitor()
elif state == "restarted":
exit_if_check_mode()
monit.restart()
elif not running and state == "started":
exit_if_check_mode()
monit.start()
elif not running and state == "monitored":
exit_if_check_mode()
monit.monitor()
module.exit_json(changed=False, name=name, state=state)
if __name__ == "__main__":
main()