Skip to content

baldur_pro.services.emergency_mode — Emergency Mode

System-wide emergency levels and the manager that activates them: EmergencyLevel, get_emergency_manager, and is_emergency_active.

🔒 PRO Feature — requires a baldur-pro license

These symbols ship in the baldur-pro distribution. PRO modules import normally — there is no ImportError. PRO features activate only when baldur.init() runs with a valid BALDUR_LICENSE_KEY; without it the system runs with OSS defaults and register_pro_services() logs entitlement.pro_registration_skipped.

emergency_mode

Emergency Mode Service — Single-instance emergency lifecycle.

Manages emergency mode activation/deactivation, traffic tier rules (Load Shedding), and gradual recovery for a single service instance.

For multi-region namespace-scoped emergency management (regional isolation, cascade detection, partition reconciliation), see regional_emergency/.

Architecture

models.emergency.EmergencyLevel (shared domain type) │ ├── emergency_mode/ ← this package (single-instance lifecycle) └── regional_emergency/ (multi-region extension, depends on this)

Features: - EmergencyLevel: emergency levels (NORMAL, LEVEL_1, LEVEL_2, LEVEL_3) - GracefulDegradationManager: emergency mode activation/deactivation - RecoveryGate: metric-based recovery stabilization and gradual recovery

Usage

from baldur_pro.services.emergency_mode import ( get_emergency_manager, EmergencyLevel, is_emergency_active, )

Check emergency status

if is_emergency_active(): level = get_emergency_manager().get_current_level() ...

Activate emergency mode

get_emergency_manager().activate_manual( level=EmergencyLevel.LEVEL_2, reason="High error rate detected", activated_by="admin", duration_minutes=30, )

Deactivate emergency mode

get_emergency_manager().deactivate(deactivated_by="admin")

Status: Public

RecoveryGateConfig dataclass

RecoveryGateConfig(
    stabilization_period_seconds: int = 300,
    require_metrics_stable: bool = True,
    cpu_threshold_percent: float = 80.0,
    error_rate_threshold: float = 0.05,
    gradual_recovery: bool = True,
    level_step_delay_seconds: int = 60,
    health_check_interval_seconds: int = 30,
    auto_rollback_on_failure: bool = True,
)

Bases: SerializableMixin

Stabilization-window configuration for safe emergency exit.

Defines how long the system must remain stable before the recovery gate releases emergency mode, plus the metric thresholds that count as "stable". Defaults are read from EmergencyModeSettings when available.

stabilization_period_seconds class-attribute instance-attribute

stabilization_period_seconds: int = 300

Stabilization wait window (seconds).

require_metrics_stable class-attribute instance-attribute

require_metrics_stable: bool = True

Whether metric-based stability checks are required.

cpu_threshold_percent class-attribute instance-attribute

cpu_threshold_percent: float = 80.0

CPU usage must be at or below this to count as stable.

error_rate_threshold class-attribute instance-attribute

error_rate_threshold: float = 0.05

Error rate must be at or below this (5%) to count as stable.

gradual_recovery class-attribute instance-attribute

gradual_recovery: bool = True

Whether to step down emergency level gradually.

level_step_delay_seconds class-attribute instance-attribute

level_step_delay_seconds: int = 60

Delay between level-down steps (seconds).

health_check_interval_seconds class-attribute instance-attribute

health_check_interval_seconds: int = 30

Metric re-check cadence during recovery (seconds).

auto_rollback_on_failure class-attribute instance-attribute

auto_rollback_on_failure: bool = True

Whether to roll back automatically when recovery fails.

from_settings classmethod

from_settings() -> RecoveryGateConfig

Create from EmergencyModeSettings, with hardcoded fallback.

EmergencyLevel

Bases: str, Enum

Emergency level definitions.

Each level determines per-tier traffic multipliers. Ordering: NORMAL < LEVEL_1 < LEVEL_2 < LEVEL_3 (severity-based).

severity property

severity: int

Numeric severity for ordering comparisons and backward compatibility.

from_severity classmethod

from_severity(severity: int) -> EmergencyLevel

Create EmergencyLevel from numeric severity (0-3).

Supports legacy integer-based code that used IntEnum values.

EmergencyModeError

EmergencyModeError(message: str = '', *, code: str = '')

Bases: BaldurError

Base exception for emergency mode operations.

EmergencyStateError

EmergencyStateError(
    message: str, *, operation: str = "", detail: str = ""
)

Bases: EmergencyModeError

Invalid state or input for emergency mode operation.

Covers: missing parameters, inactive emergency mode, duplicate recovery, invalid target level.

RecoveryNotAllowedError

RecoveryNotAllowedError(
    message: str, *, check_reason: str = ""
)

Bases: EmergencyModeError

Recovery gate policy prevents deactivation.

GracefulDegradationManager

Bases: EventEmitterMixin

Emergency-mode manager - stepwise entry into / exit from emergency mode.

Implemented as a thread-safe singleton.

Usage

manager = GracefulDegradationManager()

Manually activate emergency mode

manager.activate_manual( level=EmergencyLevel.LEVEL_2, reason="High error rate", activated_by="admin", duration_minutes=30, )

Check the current state

state = manager.get_state()

Deactivate emergency mode

manager.deactivate(deactivated_by="admin")

close

close() -> None

Unsubscribe EventBus handlers.

Idempotent: safe to call multiple times.

get_state

get_state() -> EmergencyState

Get the current state.

get_current_level

get_current_level() -> EmergencyLevel

Get the current emergency-mode level.

is_active

is_active() -> bool

Whether emergency mode is active.

get_previous_states

get_previous_states() -> list[dict[str, Any]]

Get the list of previous-state snapshots.

Returns:

Type Description
list[dict[str, Any]]

Snapshot list (newest first)

rollback_to_previous

rollback_to_previous(
    index: int = 0,
) -> EmergencyState | None

Roll back to a previous state.

Parameters:

Name Type Description Default
index int

Snapshot index to roll back to (0=most recent, 1=the one before...)

0

Returns:

Type Description
EmergencyState | None

The rolled-back state, or None on failure

get_tier_multiplier

get_tier_multiplier(tier_id: str) -> float

Return the tier multiplier for the current emergency-mode level.

Parameters:

Name Type Description Default
tier_id str

Tier ID (critical, standard, non_essential)

required

Returns:

Type Description
float

Multiplier (0.0 ~ 1.0)

activate_manual

activate_manual(
    level: EmergencyLevel,
    reason: str,
    activated_by: str,
    duration_minutes: int | None = None,
    is_chaos_experiment: bool = False,
    experiment_id: str | None = None,
    override_kill_switch: bool = False,
) -> EmergencyState

Manually activate emergency mode.

Parameters:

Name Type Description Default
level EmergencyLevel

Emergency-mode level

required
reason str

Activation reason (required)

required
activated_by str

User who activated it

required
duration_minutes int | None

Auto-expiry time (minutes); None requires manual deactivation

None
is_chaos_experiment bool

Whether activation is from a chaos experiment

False
experiment_id str | None

Related chaos experiment ID

None
override_kill_switch bool

True to allow activation when kill switch is ON

False

Returns:

Type Description
EmergencyState

The new state

activate_auto

activate_auto(
    level: EmergencyLevel,
    reason: str,
    duration_minutes: int | None = None,
) -> EmergencyState

Automatically activate emergency mode (system-triggered).

Automated activations respect the kill switch unconditionally.

Parameters:

Name Type Description Default
level EmergencyLevel

Emergency-mode level

required
reason str

Auto-detection reason

required
duration_minutes int | None

Auto-expiry time (default 30 minutes)

None

Returns:

Type Description
EmergencyState

The new state

deactivate

deactivate(
    deactivated_by: str,
    reason: str = "",
    force: bool = False,
) -> EmergencyState

Deactivate emergency mode.

Parameters:

Name Type Description Default
deactivated_by str

User who deactivated it

required
reason str

Deactivation reason (optional)

''
force bool

Force deactivation, ignoring recovery conditions

False

Returns:

Type Description
EmergencyState

The new state

start_gradual_recovery

start_gradual_recovery(
    initiated_by: str,
    target_level: EmergencyLevel = EmergencyLevel.NORMAL,
) -> EmergencyState

Start gradual recovery.

Eases stepwise from the current level down to the target level. Proceeds after a metrics check at each step.

Parameters:

Name Type Description Default
initiated_by str

User who started the recovery

required
target_level EmergencyLevel

Target level (default: NORMAL)

NORMAL

Returns:

Type Description
EmergencyState

The current state

stop_gradual_recovery

stop_gradual_recovery(
    stopped_by: str, reason: str = ""
) -> EmergencyState

Stop gradual recovery.

get_history

get_history(limit: int = 50) -> list[dict[str, Any]]

Get the change history.

set_recovery_gate_config

set_recovery_gate_config(
    config: RecoveryGateConfig, changed_by: str = "system"
)

Update the recovery gate configuration.

get_recovery_gate_config

get_recovery_gate_config() -> RecoveryGateConfig

Get the current recovery gate configuration.

reset

reset()

Reset state (for tests).

EmergencyState dataclass

EmergencyState(
    level: EmergencyLevel = EmergencyLevel.NORMAL,
    is_active: bool = False,
    activated_at: str | None = None,
    activated_by: str | None = None,
    activation_reason: str | None = None,
    expires_at: str | None = None,
    deactivated_at: str | None = None,
    deactivated_by: str | None = None,
    is_auto_triggered: bool = False,
    is_recovering: bool = False,
    recovery_started_at: str | None = None,
    target_level: EmergencyLevel | None = None,
    metadata: dict[str, Any] | None = None,
)

Bases: SerializableMixin

Emergency-mode state.

metadata class-attribute instance-attribute

metadata: dict[str, Any] | None = None

Additional metadata.

When activated by a chaos experiment

{ "is_chaos_experiment": True, "experiment_id": "exp-xxx", "classification": "chaos_induced_test" }

RecoveryGate

RecoveryGate(
    config: RecoveryGateConfig | None = None,
    metrics_checker: (
        Callable[[], dict[str, float]] | None
    ) = None,
)

Recovery gate - manages safe deactivation of emergency mode.

Features: - Metric-based stability checks - Gradual recovery (stepwise easing per level) - Automatic rollback on recovery failure

Parameters:

Name Type Description Default
config RecoveryGateConfig | None

Recovery gate configuration

None
metrics_checker Callable[[], dict[str, float]] | None

Callback returning the current system metrics Return value: {"cpu_percent": 75.0, "error_rate": 0.02}

None

check_recovery_allowed

check_recovery_allowed() -> tuple[bool, str]

Check whether recovery is allowed.

Returns:

Type Description
(is_allowed, reason)

whether recovery is allowed and the reason

get_next_recovery_level

get_next_recovery_level(
    current_level: EmergencyLevel,
) -> EmergencyLevel | None

Return the next level in gradual recovery.

LEVEL_3 -> LEVEL_2 -> LEVEL_1 -> NORMAL

Returns:

Type Description
EmergencyLevel | None

The next level, or None (when already NORMAL)

get_emergency_manager

get_emergency_manager() -> GracefulDegradationManager

Get the emergency-mode manager singleton.

reset_emergency_manager

reset_emergency_manager() -> None

Reset the emergency manager (clears both module-level and class-level singletons, joins recovery thread).

is_emergency_active

is_emergency_active() -> bool

Check whether emergency mode is active (convenience function).

Usage

if is_emergency_active(): # Handle emergency mode ...

get_emergency_level

get_emergency_level() -> EmergencyLevel

Get the current emergency-mode level (convenience function).

Usage

level = get_emergency_level() if level >= EmergencyLevel.LEVEL_2: # Handle a severe emergency ...

get_tier_multiplier

get_tier_multiplier(tier_id: str) -> float

Get the tier multiplier for the current emergency mode (convenience function).

Usage

multiplier = get_tier_multiplier("standard") if random.random() > multiplier: # Load shedding return Response(status=503)