dm_mac.models.machine module

Model representing a machine.

dm_mac.models.machine.FLEET_TIMEOUT_COOLDOWN_SEC: float = 300.0

Minimum spacing between consecutive fleet-wide Slack notifications, to avoid spamming the channel during a sustained disk hang.

dm_mac.models.machine.FLEET_TIMEOUT_THRESHOLD: int = 2

Minimum distinct machines within FLEET_TIMEOUT_WINDOW_SEC that triggers the fleet-wide Slack notification.

dm_mac.models.machine.FLEET_TIMEOUT_WINDOW_SEC: float = 60.0

Window over which to count distinct machines that hit state-save timeouts for the fleet-wide Slack alert (see FleetTimeoutTracker).

class dm_mac.models.machine.FleetTimeoutTracker(window_sec: float = 60.0, threshold: int = 2, cooldown_sec: float = 300.0)

Bases: object

Cross-machine accounting for state-save timeouts.

The per-machine Slack notification in MachineState._notify_save_timeout() only fires on the transition to 2 lifetime timeouts for a single machine, which is the right signal for “this machine is repeatedly slow”. It is the wrong signal for “the disk on the mac-server host just hung”, which produces the 2026-05-11 pattern: N distinct machines each hit their first lifetime timeout simultaneously, every per- machine counter goes 0 → 1, and no Slack message fires.

This tracker fills that gap. Each timeout records (machine_name, monotonic_ts); when at least FLEET_TIMEOUT_THRESHOLD distinct machines have recorded a timeout within FLEET_TIMEOUT_WINDOW_SEC, the tracker signals that a fleet-wide notification should fire, subject to FLEET_TIMEOUT_COOLDOWN_SEC between consecutive notifications.

See docs/2026-05-11-mcu-lockup-analysis.md for the motivating incident.

_events: Deque[Tuple[str, float]]
_last_notification_ts: float | None
cooldown_sec: float
record(machine_name: str, now: float | None = None) int | None

Record a state-save timeout for machine_name.

Parameters:
  • machine_name – Internal machine name (not display name).

  • now – Override the current monotonic timestamp; used by tests. Production callers should omit this.

Returns:

None if no fleet-wide notification should fire; otherwise the count of distinct machines within the window at the moment the threshold was crossed. A non-None return implicitly arms the cooldown.

threshold: int
window_sec: float
class dm_mac.models.machine.Machine(name: str, authorizations_or: List[str], unauthorized_warn_only: bool = False, always_enabled: bool = False, alias: str | None = None, second_relay: SecondRelayConfig | None = None)

Bases: object

Object representing a machine and its state and configuration.

alias: str | None

Optional human-friendly alias for the machine

always_enabled: bool

Whether machine is always enabled without RFID authentication

property as_dict: Dict[str, Any]

Return a dict representation of this machine.

authorizations_or: List[str]

List of OR’ed authorizations, any of which is sufficient

property display_name: str

Return the display name for this machine (alias if present, else name).

async lockout(slack: SlackHandler | None = None) None

Pass directly to self.state.

name: str

The name of the machine

async oops(slack: SlackHandler | None = None) None

Pass directly to self.state.

second_relay: SecondRelayConfig | None

Optional second-relay configuration

state: MachineState

state of the machine

unauthorized_warn_only: bool

Whether to allow anyone to operate machine regardless of authorization, just logging/displaying a warning if unauthorized

async unlock(slack: SlackHandler | None = None) None

Pass directly to self.state.

async unoops(slack: SlackHandler | None = None) None

Pass directly to self.state.

async update(users: UsersConfig, **kwargs: Any) Dict[str, str | bool | float | List[float]]

Pass directly to self.state and return result.

class dm_mac.models.machine.MachineState(machine: Machine, load_state: bool = True)

Bases: object

Object representing frozen state in time of a machine.

ALWAYS_ON_DISPLAY_TEXT: str = 'Always On'
DEFAULT_DISPLAY_TEXT: str = 'Please Insert\nRFID Card'
LOCKOUT_DISPLAY_TEXT: str = 'Down for\nmaintenance'
OOPS_DISPLAY_TEXT: str = 'Oops!! Please\ncheck/post Slack'
STATUS_LED_BRIGHTNESS: float = 0.5
async _handle_oops(users: UsersConfig) None

Handle oops button press.

async _handle_reboot() None

Handle when the ESP32 (MCU) has rebooted since last checkin.

This logs out the current user if logged in and resets the machine state. For always-enabled machines, restores the always-on state.

async _handle_rfid_insert(users: UsersConfig, rfid_value: str) None

Handle change in the RFID value.

async _handle_rfid_remove() None

Handle RFID card removed.

async _handle_rfid_tracking_always_enabled(users: UsersConfig, rfid_value: str | None) None

Track RFID changes for always-enabled machines without changing state.

This method logs RFID insertions and removals for auditing purposes while maintaining the always-on state of the machine.

_load_from_cache() None

Load machine state cache from disk.

_lock: lock
_log_second_relay_decision() None

Emit a structured AUTH log line for the current second-relay decision.

_notify_fleet_save_timeout() None

Fire a Slack alert if multiple machines hit timeouts in a short window.

Complements _notify_save_timeout(): that rule pages on the second lifetime timeout for one machine (“this machine is slow”); this rule pages when FLEET_TIMEOUT_THRESHOLD distinct machines hit any timeout within FLEET_TIMEOUT_WINDOW_SEC (“the disk is slow”). Cooldown via the tracker prevents re-paging during a sustained hang.

_notify_save_timeout(count: int) None

Fire a fire-and-forget Slack notification on the 2nd save timeout.

Skipped on the first timeout to tolerate single transient stalls; fired exactly once on the transition to 2 to avoid spamming SLACK_CONTROL_CHANNEL_ID under a sustained disk hang (where timeouts can arrive every ~10 s as MCU heartbeats keep coming). Operators monitoring the mac_state_save_timeouts_total Prometheus counter can alert on sustained increase from there.

_on_save_task_done(task: Task[None]) None

Done-callback for the in-flight save task.

Logs (and thus consumes) any exception the underlying _save_cache() raised, so a thread that finishes after we have already timed out cannot leak unhandled exceptions into the event loop. Also clears _save_task if this is still the current task, so a subsequent successful save can run.

_record_save_timeout(reason: str) int

Increment the timeout counter, log, and notify Slack.

Returns the post-increment lifetime count so callers can include it in the raised exception.

_resolve_second_relay(emit_log: bool = True) None

Compute desired second-relay state and authorization decision.

Called after every primary-state mutation. Sets second_relay_desired_state and second_relay_authorization per the decision tree in data-model.md. Fails closed on unexpected errors (False / “denied”). Emits a structured AUTH log line for each decision unless emit_log is False (used to avoid double logging when callers will log later).

_save_cache() None

Save machine state cache to disk (synchronous).

Acquires the in-process lock and on-disk filelock, builds the state dict, and writes the pickle. Used directly by maintenance tools and tests; request handlers should call save_cache() instead so the write is bounded by STATE_SAVE_TIMEOUT_SEC.

_save_spawn_lock: Lock | None

Guards the check-and-set of _save_task so two concurrent callers cannot both observe _save_task as None/done() and spawn separate workers. Lazily created on first use so we don’t bind to a specific event loop at construction time.

_save_task: Task[None] | None

Tracks the in-flight asyncio.to_thread task spawned by save_cache(). While this task is running (or hung on a stuck disk) subsequent calls to save_cache() join the existing task instead of spawning more threads, so a single hung disk write cannot exhaust the default thread pool. Each joiner gets its own STATE_SAVE_TIMEOUT_SEC budget, so brief overlap finishes successfully while a sustained hang produces independent timeout events on each subsequent request (which is what drives the mac_state_save_timeouts_total counter and the Slack-on-second-timeout rule).

_state_dir: str

Path to the directory to save machine state in

_state_path: str

Path to pickled state file

async _user_is_authorized(user: User, slack: SlackHandler | None = None) bool

Return whether user is authorized for this machine.

_user_is_second_authorized(user: User) bool

Return whether user holds any of the second-relay authorizations.

current_amps: float

Last reported output ammeter reading (if equipped).

current_user: User | None

Current user logged in to the machine

display_text: str

Text currently displayed on the machine LCD screen

internal_temperature_c: float | None

ESP32 internal temperature in °C

is_locked_out: bool

Whether the machine is locked out from use.

is_oopsed: bool

Whether the machine’s Oops button has been pressed.

is_override_login: bool

Whether the machine is in an override login state

last_checkin: float | None

Float timestamp of the machine’s last checkin time

last_update: float | None

Float timestamp of the last time that machine state changed in a meaningful way, i.e. RFID value or Oops

lockout() None

Lock-out the machine.

machine: Machine

The Machine that this state is for

property machine_response: Dict[str, str | bool | float | List[float]]

Return the response dict to send to the machine.

oops(do_locking: bool = True) None

Oops the machine.

relay_desired_state: bool

Whether the output relay should be on or not.

rfid_present_since: float | None

Float timestamp when rfid_value last changed to a non-None value.

rfid_value: str | None

Value of the RFID card/fob in use, or None if not present.

async save_cache() None

Save machine state cache to disk with a timeout.

Single-flight per machine: only one save thread is outstanding at a time. Concurrent callers see the existing in-flight task and join it (awaiting the same task) rather than spawning a second thread that would also block on the same disk lock; this prevents thread-pool exhaustion under a sustained disk hang while heartbeats keep arriving.

Whether the caller spawned the task or joined an existing one, it then awaits with its own STATE_SAVE_TIMEOUT_SEC budget. Brief overlap (the existing save finishes within the joiner’s budget) returns success without counting a timeout. A sustained hang produces an independent timeout event on each request that exceeds its budget; the second such event triggers the Slack notification.

On timeout, the underlying thread is shielded and continues running (Python cannot cancel a thread blocked on file I/O); state_save_timeouts is incremented and StateSaveTimeoutError is raised.

second_relay_authorization: str | None

Authorization decision outcome for the second relay (granted/denied/warn/always_enabled), or None if no second relay.

second_relay_desired_state: bool

Whether the server wants the second relay energized.

state_save_timeouts: int

Lifetime count of state-save timeouts for this machine. Persisted with the rest of the machine state (best-effort: a write that itself times out cannot persist the increment until the next successful save); surfaced as the mac_state_save_timeouts_total Prometheus counter from the in-memory value, which is always increment-correct because save_cache() is single-flight per machine.

status_led_brightness: float

status LED brightness value; float 0 to 1

status_led_rgb: Tuple[float, float, float]

RGB values for status LED; floats 0 to 1

unlock() None

Un-lock-out the machine.

unoops(do_locking: bool = True) None

Un-oops the machine.

async update(users: UsersConfig, oops: bool = False, rfid_value: str | None = None, uptime: float | None = None, wifi_signal_db: float | None = None, wifi_signal_percent: float | None = None, internal_temperature_c: float | None = None, amps: float | None = None, second_relay_state: bool | None = None) Dict[str, str | bool | float | List[float]]

Handle an update to the machine via API.

uptime: float

Uptime of the machine’s ESP32 in seconds

wifi_signal_db: float | None

ESP32 WiFi signal strength in dB

wifi_signal_percent: float | None

ESP32 WiFi signal strength in percent

class dm_mac.models.machine.MachinesConfig

Bases: object

Class representing machines configuration file.

_load_and_validate_config() Dict[str, Dict[str, Any]]

Load and validate the config file.

get_machine(name_or_alias: str) Machine | None

Get a machine by name or alias.

load_time: float
machines: List[Machine]
machines_by_alias: Dict[str, Machine]
machines_by_name: Dict[str, Machine]
static validate_config(config: Dict[str, Dict[str, Any]]) None

Validate configuration via jsonschema.

dm_mac.models.machine.STATE_SAVE_TIMEOUT_SEC: float = 2.0

Maximum wall-clock seconds we will spend persisting machine state to disk before raising StateSaveTimeoutError. Keeps a single hung disk write from blocking the request handler long enough to wedge the firmware (see docs/2026-05-05-mcu-lockup-analysis.md).

class dm_mac.models.machine.SecondRelayConfig(authorizations_or: List[str], unauthorized_warn_only: bool = False, always_enabled: bool = False, alias: str | None = None)

Bases: object

Authorization rules governing a machine’s second relay.

alias: str | None
always_enabled: bool
property as_dict: Dict[str, Any]

Return a dict representation of this second relay config.

authorizations_or: List[str]
unauthorized_warn_only: bool
exception dm_mac.models.machine.StateSaveTimeoutError

Bases: Exception

Raised when persisting machine state to disk exceeds the budget.

Surfaced to MCU clients as HTTP 503 by the /api/machine/update view (and by /api/machine/oops/<name> and /api/machine/locked_out/<name>) so the firmware sees a clean error and recovers on its next heartbeat.