dm_mac.models.machine module¶
Model representing a machine.
- dm_mac.models.machine.FLEET_TIMEOUT_COOLDOWN_SEC: float = 300.0¶
Minimum spacing between consecutive fleet-wide Slack notifications, to avoid spamming the channel during a sustained disk hang.
- dm_mac.models.machine.FLEET_TIMEOUT_THRESHOLD: int = 2¶
Minimum distinct machines within
FLEET_TIMEOUT_WINDOW_SECthat triggers the fleet-wide Slack notification.
- dm_mac.models.machine.FLEET_TIMEOUT_WINDOW_SEC: float = 60.0¶
Window over which to count distinct machines that hit state-save timeouts for the fleet-wide Slack alert (see
FleetTimeoutTracker).
- class dm_mac.models.machine.FleetTimeoutTracker(window_sec: float = 60.0, threshold: int = 2, cooldown_sec: float = 300.0)¶
Bases:
objectCross-machine accounting for state-save timeouts.
The per-machine Slack notification in
MachineState._notify_save_timeout()only fires on the transition to 2 lifetime timeouts for a single machine, which is the right signal for “this machine is repeatedly slow”. It is the wrong signal for “the disk on the mac-server host just hung”, which produces the 2026-05-11 pattern: N distinct machines each hit their first lifetime timeout simultaneously, every per- machine counter goes 0 → 1, and no Slack message fires.This tracker fills that gap. Each timeout records
(machine_name, monotonic_ts); when at leastFLEET_TIMEOUT_THRESHOLDdistinct machines have recorded a timeout withinFLEET_TIMEOUT_WINDOW_SEC, the tracker signals that a fleet-wide notification should fire, subject toFLEET_TIMEOUT_COOLDOWN_SECbetween consecutive notifications.See
docs/2026-05-11-mcu-lockup-analysis.mdfor the motivating incident.- _events: Deque[Tuple[str, float]]¶
- _last_notification_ts: float | None¶
- cooldown_sec: float¶
- record(machine_name: str, now: float | None = None) int | None¶
Record a state-save timeout for
machine_name.- Parameters:
machine_name – Internal machine name (not display name).
now – Override the current monotonic timestamp; used by tests. Production callers should omit this.
- Returns:
Noneif no fleet-wide notification should fire; otherwise the count of distinct machines within the window at the moment the threshold was crossed. A non-None return implicitly arms the cooldown.
- threshold: int¶
- window_sec: float¶
- class dm_mac.models.machine.Machine(name: str, authorizations_or: List[str], unauthorized_warn_only: bool = False, always_enabled: bool = False, alias: str | None = None, second_relay: SecondRelayConfig | None = None)¶
Bases:
objectObject representing a machine and its state and configuration.
- alias: str | None¶
Optional human-friendly alias for the machine
- always_enabled: bool¶
Whether machine is always enabled without RFID authentication
- property as_dict: Dict[str, Any]¶
Return a dict representation of this machine.
- authorizations_or: List[str]¶
List of OR’ed authorizations, any of which is sufficient
- property display_name: str¶
Return the display name for this machine (alias if present, else name).
- async lockout(slack: SlackHandler | None = None) None¶
Pass directly to self.state.
- name: str¶
The name of the machine
- async oops(slack: SlackHandler | None = None) None¶
Pass directly to self.state.
- second_relay: SecondRelayConfig | None¶
Optional second-relay configuration
- state: MachineState¶
state of the machine
- unauthorized_warn_only: bool¶
Whether to allow anyone to operate machine regardless of authorization, just logging/displaying a warning if unauthorized
- async unlock(slack: SlackHandler | None = None) None¶
Pass directly to self.state.
- async unoops(slack: SlackHandler | None = None) None¶
Pass directly to self.state.
- async update(users: UsersConfig, **kwargs: Any) Dict[str, str | bool | float | List[float]]¶
Pass directly to self.state and return result.
- class dm_mac.models.machine.MachineState(machine: Machine, load_state: bool = True)¶
Bases:
objectObject representing frozen state in time of a machine.
- ALWAYS_ON_DISPLAY_TEXT: str = 'Always On'¶
- DEFAULT_DISPLAY_TEXT: str = 'Please Insert\nRFID Card'¶
- LOCKOUT_DISPLAY_TEXT: str = 'Down for\nmaintenance'¶
- OOPS_DISPLAY_TEXT: str = 'Oops!! Please\ncheck/post Slack'¶
- STATUS_LED_BRIGHTNESS: float = 0.5¶
- async _handle_oops(users: UsersConfig) None¶
Handle oops button press.
- async _handle_reboot() None¶
Handle when the ESP32 (MCU) has rebooted since last checkin.
This logs out the current user if logged in and resets the machine state. For always-enabled machines, restores the always-on state.
- async _handle_rfid_insert(users: UsersConfig, rfid_value: str) None¶
Handle change in the RFID value.
- async _handle_rfid_remove() None¶
Handle RFID card removed.
- async _handle_rfid_tracking_always_enabled(users: UsersConfig, rfid_value: str | None) None¶
Track RFID changes for always-enabled machines without changing state.
This method logs RFID insertions and removals for auditing purposes while maintaining the always-on state of the machine.
- _load_from_cache() None¶
Load machine state cache from disk.
- _lock: lock¶
- _log_second_relay_decision() None¶
Emit a structured AUTH log line for the current second-relay decision.
- _notify_fleet_save_timeout() None¶
Fire a Slack alert if multiple machines hit timeouts in a short window.
Complements
_notify_save_timeout(): that rule pages on the second lifetime timeout for one machine (“this machine is slow”); this rule pages whenFLEET_TIMEOUT_THRESHOLDdistinct machines hit any timeout withinFLEET_TIMEOUT_WINDOW_SEC(“the disk is slow”). Cooldown via the tracker prevents re-paging during a sustained hang.
- _notify_save_timeout(count: int) None¶
Fire a fire-and-forget Slack notification on the 2nd save timeout.
Skipped on the first timeout to tolerate single transient stalls; fired exactly once on the transition to 2 to avoid spamming
SLACK_CONTROL_CHANNEL_IDunder a sustained disk hang (where timeouts can arrive every ~10 s as MCU heartbeats keep coming). Operators monitoring themac_state_save_timeouts_totalPrometheus counter can alert on sustained increase from there.
- _on_save_task_done(task: Task[None]) None¶
Done-callback for the in-flight save task.
Logs (and thus consumes) any exception the underlying
_save_cache()raised, so a thread that finishes after we have already timed out cannot leak unhandled exceptions into the event loop. Also clears_save_taskif this is still the current task, so a subsequent successful save can run.
- _record_save_timeout(reason: str) int¶
Increment the timeout counter, log, and notify Slack.
Returns the post-increment lifetime count so callers can include it in the raised exception.
- _resolve_second_relay(emit_log: bool = True) None¶
Compute desired second-relay state and authorization decision.
Called after every primary-state mutation. Sets
second_relay_desired_stateandsecond_relay_authorizationper the decision tree in data-model.md. Fails closed on unexpected errors (False / “denied”). Emits a structured AUTH log line for each decision unlessemit_logis False (used to avoid double logging when callers will log later).
- _save_cache() None¶
Save machine state cache to disk (synchronous).
Acquires the in-process lock and on-disk filelock, builds the state dict, and writes the pickle. Used directly by maintenance tools and tests; request handlers should call
save_cache()instead so the write is bounded bySTATE_SAVE_TIMEOUT_SEC.
- _save_spawn_lock: Lock | None¶
Guards the check-and-set of
_save_taskso two concurrent callers cannot both observe_save_taskasNone/done()and spawn separate workers. Lazily created on first use so we don’t bind to a specific event loop at construction time.
- _save_task: Task[None] | None¶
Tracks the in-flight
asyncio.to_threadtask spawned bysave_cache(). While this task is running (or hung on a stuck disk) subsequent calls tosave_cache()join the existing task instead of spawning more threads, so a single hung disk write cannot exhaust the default thread pool. Each joiner gets its ownSTATE_SAVE_TIMEOUT_SECbudget, so brief overlap finishes successfully while a sustained hang produces independent timeout events on each subsequent request (which is what drives themac_state_save_timeouts_totalcounter and the Slack-on-second-timeout rule).
- _state_dir: str¶
Path to the directory to save machine state in
- _state_path: str¶
Path to pickled state file
- async _user_is_authorized(user: User, slack: SlackHandler | None = None) bool¶
Return whether user is authorized for this machine.
- _user_is_second_authorized(user: User) bool¶
Return whether user holds any of the second-relay authorizations.
- current_amps: float¶
Last reported output ammeter reading (if equipped).
- display_text: str¶
Text currently displayed on the machine LCD screen
- internal_temperature_c: float | None¶
ESP32 internal temperature in °C
- is_locked_out: bool¶
Whether the machine is locked out from use.
- is_oopsed: bool¶
Whether the machine’s Oops button has been pressed.
- is_override_login: bool¶
Whether the machine is in an override login state
- last_checkin: float | None¶
Float timestamp of the machine’s last checkin time
- last_update: float | None¶
Float timestamp of the last time that machine state changed in a meaningful way, i.e. RFID value or Oops
- lockout() None¶
Lock-out the machine.
- property machine_response: Dict[str, str | bool | float | List[float]]¶
Return the response dict to send to the machine.
- oops(do_locking: bool = True) None¶
Oops the machine.
- relay_desired_state: bool¶
Whether the output relay should be on or not.
- rfid_present_since: float | None¶
Float timestamp when rfid_value last changed to a non-None value.
- rfid_value: str | None¶
Value of the RFID card/fob in use, or None if not present.
- async save_cache() None¶
Save machine state cache to disk with a timeout.
Single-flight per machine: only one save thread is outstanding at a time. Concurrent callers see the existing in-flight task and join it (awaiting the same task) rather than spawning a second thread that would also block on the same disk lock; this prevents thread-pool exhaustion under a sustained disk hang while heartbeats keep arriving.
Whether the caller spawned the task or joined an existing one, it then awaits with its own
STATE_SAVE_TIMEOUT_SECbudget. Brief overlap (the existing save finishes within the joiner’s budget) returns success without counting a timeout. A sustained hang produces an independent timeout event on each request that exceeds its budget; the second such event triggers the Slack notification.On timeout, the underlying thread is shielded and continues running (Python cannot cancel a thread blocked on file I/O);
state_save_timeoutsis incremented andStateSaveTimeoutErroris raised.
- second_relay_authorization: str | None¶
Authorization decision outcome for the second relay (granted/denied/warn/always_enabled), or None if no second relay.
- second_relay_desired_state: bool¶
Whether the server wants the second relay energized.
- state_save_timeouts: int¶
Lifetime count of state-save timeouts for this machine. Persisted with the rest of the machine state (best-effort: a write that itself times out cannot persist the increment until the next successful save); surfaced as the
mac_state_save_timeouts_totalPrometheus counter from the in-memory value, which is always increment-correct becausesave_cache()is single-flight per machine.
- status_led_brightness: float¶
status LED brightness value; float 0 to 1
- status_led_rgb: Tuple[float, float, float]¶
RGB values for status LED; floats 0 to 1
- unlock() None¶
Un-lock-out the machine.
- unoops(do_locking: bool = True) None¶
Un-oops the machine.
- async update(users: UsersConfig, oops: bool = False, rfid_value: str | None = None, uptime: float | None = None, wifi_signal_db: float | None = None, wifi_signal_percent: float | None = None, internal_temperature_c: float | None = None, amps: float | None = None, second_relay_state: bool | None = None) Dict[str, str | bool | float | List[float]]¶
Handle an update to the machine via API.
- uptime: float¶
Uptime of the machine’s ESP32 in seconds
- wifi_signal_db: float | None¶
ESP32 WiFi signal strength in dB
- wifi_signal_percent: float | None¶
ESP32 WiFi signal strength in percent
- class dm_mac.models.machine.MachinesConfig¶
Bases:
objectClass representing machines configuration file.
- _load_and_validate_config() Dict[str, Dict[str, Any]]¶
Load and validate the config file.
- load_time: float¶
- static validate_config(config: Dict[str, Dict[str, Any]]) None¶
Validate configuration via jsonschema.
- dm_mac.models.machine.STATE_SAVE_TIMEOUT_SEC: float = 2.0¶
Maximum wall-clock seconds we will spend persisting machine state to disk before raising
StateSaveTimeoutError. Keeps a single hung disk write from blocking the request handler long enough to wedge the firmware (seedocs/2026-05-05-mcu-lockup-analysis.md).
- class dm_mac.models.machine.SecondRelayConfig(authorizations_or: List[str], unauthorized_warn_only: bool = False, always_enabled: bool = False, alias: str | None = None)¶
Bases:
objectAuthorization rules governing a machine’s second relay.
- alias: str | None¶
- always_enabled: bool¶
- property as_dict: Dict[str, Any]¶
Return a dict representation of this second relay config.
- authorizations_or: List[str]¶
- unauthorized_warn_only: bool¶
- exception dm_mac.models.machine.StateSaveTimeoutError¶
Bases:
ExceptionRaised when persisting machine state to disk exceeds the budget.
Surfaced to MCU clients as HTTP 503 by the
/api/machine/updateview (and by/api/machine/oops/<name>and/api/machine/locked_out/<name>) so the firmware sees a clean error and recovers on its next heartbeat.