baldur.services — Control API & Metrics
The control-API request/response surface for runtime operations, and the
metrics helpers. The metrics symbols resolve lazily and require the
[prometheus] extra at runtime.
Control API
ControlAPIService
ControlAPIService()
Baldur Control API Service.
Provides a unified, auditable, reversible, and governed control surface to manage reliability behaviors across testing, chaos experimentation, and real production operations.
Usage
service = ControlAPIService()
Execute control action
response = service.execute(ControlRequest( service_name="payment", action="allow", reason="PG recovered", environment="ops" ))
Get current status
status = service.get_status(environment="ops")
Get audit logs
logs = service.get_audit_logs(service_name="payment")
Initialize the Control API Service.
execute
execute(request: ControlRequest) -> ControlResponse
Execute a control API action.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request
|
ControlRequest
|
Control request |
required |
Returns:
| Type | Description |
|---|---|
ControlResponse
|
ControlResponse with outcome |
get_status
get_status(environment: str = 'ops') -> dict
Get the current status of all services.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
environment
|
str
|
Current environment context |
'ops'
|
Returns:
| Type | Description |
|---|---|
dict
|
Status dictionary with all service states |
get_service_status
get_service_status(service_name: str) -> dict
Get the status of a specific service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
service_name
|
str
|
Service to check |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Service state dictionary |
is_failure_injection_active
is_failure_injection_active(service_name: str) -> bool
Check if failure injection is active for a service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
service_name
|
str
|
Service to check |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if failures should be injected |
get_failure_injection_config
get_failure_injection_config(
service_name: str,
) -> dict | None
Get failure injection configuration for a service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
service_name
|
str
|
Service to check |
required |
Returns:
| Type | Description |
|---|---|
dict | None
|
Configuration dict or None |
get_metrics
get_metrics() -> dict
Collect comprehensive baldur metrics for trend analysis.
Returns operational metrics for dashboards, AI agents, and monitoring. Unlike status (point-in-time snapshot), metrics provide trend data.
Consumers: - Admin UI: Dashboard visualization - AI Agent: Automated decision making - Prometheus/Grafana: Metrics scraping - External Monitoring: Alerting integration
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with comprehensive metrics data |
ControlRequest
dataclass
ControlRequest(
service_name: str,
action: str,
reason: str,
environment: str,
ttl_minutes: int | None = None,
request_id: str = (lambda: str(uuid.uuid4()))(),
metadata: dict = dict(),
actor: str = "system",
actor_role: str = "automation",
)
Internal representation of a control API request.
ControlResponse
dataclass
ControlResponse(
status: str,
action_applied: str,
system_state: str = "",
effective_until: str | None = None,
reason_classification: str = "",
evidence: dict = dict(),
correlation_id: str = (lambda: str(uuid.uuid4()))(),
error_code: str = "",
error_message: str = "",
risk_level: str = "",
)
Bases: SerializableMixin
Internal representation of a control API response.
Metrics
record_sla_breach
record_sla_breach(domain: str) -> None
Record an SLA breach event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str
|
Business domain where breach occurred |
required |
collect_all_metrics
collect_all_metrics() -> dict
Collect all baldur metrics.
This should be called by a periodic Celery task.
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with all current metric values |
ALERTING_RULES
module-attribute
ALERTING_RULES: dict = {
"DLQPendingHigh": {
"expr": "dlq_pending_count > 10",
"for": "5m",
"severity": "warning",
"team": "ops",
"summary": "DLQ pending count is high",
"description": "More than 10 items pending in DLQ for domain {{ $labels.domain }}",
"runbook_url": "https://docs.internal/runbooks/dlq-pending-high",
},
"DLQPendingCritical": {
"expr": "dlq_pending_count > 50",
"for": "5m",
"severity": "critical",
"team": "ops",
"summary": "DLQ pending count is critical",
"description": "More than 50 items pending in DLQ for domain {{ $labels.domain }}",
"runbook_url": "https://docs.internal/runbooks/dlq-pending-critical",
},
"DLQGrowthRateHigh": {
"expr": "rate(dlq_created_total[5m]) > 5",
"for": "5m",
"severity": "warning",
"team": "ops",
"summary": "DLQ growth rate is high",
"description": "More than 5 new DLQ items per minute for domain {{ $labels.domain }}",
"runbook_url": "https://docs.internal/runbooks/dlq-growth-high",
},
"RetrySuccessRateLow": {
"expr": "retry_success_rate < 70",
"for": "15m",
"severity": "warning",
"team": "dev",
"summary": "Retry success rate is low",
"description": "Retry success rate below 70% for domain {{ $labels.domain }}",
"runbook_url": "https://docs.internal/runbooks/retry-success-low",
},
"CircuitBreakerOpen": {
"expr": "circuit_breaker_state == 1",
"for": "1m",
"severity": "critical",
"team": "ops",
"summary": "Circuit breaker is open",
"description": "Circuit breaker for {{ $labels.service }} is in OPEN state",
"runbook_url": "https://docs.internal/runbooks/circuit-breaker-open",
},
"CircuitBreakerOpenLong": {
"expr": "circuit_breaker_state == 1",
"for": "10m",
"severity": "critical",
"team": "ops",
"summary": "Circuit breaker open for extended period",
"description": "Circuit breaker for {{ $labels.service }} has been open for more than 10 minutes",
"runbook_url": "https://docs.internal/runbooks/circuit-breaker-extended",
},
"SLABreachDetected": {
"expr": "increase(sla_breach_total[1h]) > 0",
"for": "0m",
"severity": "warning",
"team": "ops",
"summary": "SLA breach detected",
"description": "SLA breach detected for domain {{ $labels.domain }}",
"runbook_url": "https://docs.internal/runbooks/sla-breach",
},
"RecoveryTimeSlow": {
"expr": "histogram_quantile(0.95, rate(recovery_time_seconds_bucket[1h])) > 1800",
"for": "15m",
"severity": "warning",
"team": "ops",
"summary": "Recovery time P95 is slow",
"description": "95th percentile recovery time exceeds 30 minutes",
"runbook_url": "https://docs.internal/runbooks/recovery-slow",
},
"HumanReviewQueueLong": {
"expr": "histogram_quantile(0.95, rate(human_review_queue_time_seconds_bucket[1h])) > 3600",
"for": "30m",
"severity": "warning",
"team": "ops",
"summary": "Human review queue time is high",
"description": "Items waiting more than 1 hour for human review",
"runbook_url": "https://docs.internal/runbooks/review-queue-long",
},
"ReplayFailureRateHigh": {
"expr": "sum(rate(replay_outcomes_total{outcome='failure'}[1h])) / sum(rate(replay_outcomes_total[1h])) > 0.5",
"for": "15m",
"severity": "warning",
"team": "dev",
"summary": "Replay failure rate is high",
"description": "More than 50% of replay attempts are failing",
"runbook_url": "https://docs.internal/runbooks/replay-failure-high",
},
"ErrorBudgetCritical": {
"expr": 'error_budget_remaining_percent{tier="critical"} < 10 or error_budget_remaining_percent{tier="standard"} < 20 or error_budget_remaining_percent{tier="non_essential"} < 30',
"for": "5m",
"severity": "critical",
"team": "ops",
"summary": "Error budget critical - tier-aware deployment freeze",
"description": "Error budget remaining is {{ $value }}% (tier={{ $labels.tier }}, region={{ $labels.region }}). Deployment freeze is recommended.",
"runbook_url": "https://docs.internal/runbooks/error-budget-critical",
},
"ErrorBudgetWarning": {
"expr": 'error_budget_remaining_percent{tier="critical"} < 30 or error_budget_remaining_percent{tier="standard"} < 50 or error_budget_remaining_percent{tier="non_essential"} < 60',
"for": "10m",
"severity": "warning",
"team": "ops",
"summary": "Error budget warning - tier-aware",
"description": "Error budget remaining is {{ $value }}% (tier={{ $labels.tier }}, region={{ $labels.region }}). Consider reducing deployments.",
"runbook_url": "https://docs.internal/runbooks/error-budget-warning",
},
"ErrorBudgetFastBurn": {
"expr": "error_budget_burn_rate_1h > 14.4",
"for": "5m",
"severity": "critical",
"team": "ops",
"summary": "Fast error budget burn detected",
"description": "1-hour burn rate is {{ $value }}x. Consuming 2%+ budget per hour.",
"runbook_url": "https://docs.internal/runbooks/error-budget-fast-burn",
},
"ErrorBudgetSlowBurn": {
"expr": "error_budget_burn_rate_6h > 3",
"for": "30m",
"severity": "warning",
"team": "ops",
"summary": "Slow error budget burn detected",
"description": "6-hour burn rate is {{ $value }}x. Sustained elevated error rate.",
"runbook_url": "https://docs.internal/runbooks/error-budget-slow-burn",
},
"DeploymentFreezeActive": {
"expr": "deployment_freeze_status >= 3",
"for": "0m",
"severity": "info",
"team": "ops",
"summary": "Deployment freeze is active",
"description": "Deployment freeze is recommended or in effect.",
"runbook_url": "https://docs.internal/runbooks/deployment-freeze",
},
"FailSafeTriggered": {
"expr": "increase(baldur_failsafe_triggered_total[5m]) > 0",
"for": "0m",
"severity": "critical",
"team": "ops",
"summary": "🚨 Baldur Fail-Safe mode activated",
"description": "Baldur system component '{{ $labels.component }}' has failed and Fail-Safe mode is active. Deployments are proceeding but system needs immediate attention.",
"runbook_url": "https://docs.internal/runbooks/baldur-failsafe",
},
"FailSafeModeActive": {
"expr": "baldur_failsafe_mode_active == 1",
"for": "2m",
"severity": "critical",
"team": "ops",
"summary": "🚨 Baldur in degraded mode",
"description": "Baldur '{{ $labels.component }}' is operating in Fail-Safe mode. Error Budget recommendations are not available. Investigate and restore normal operation immediately.",
"runbook_url": "https://docs.internal/runbooks/baldur-failsafe",
},
"BaldurServiceDead": {
"expr": "time() - baldur_heartbeat_timestamp_seconds > 120",
"for": "0m",
"severity": "critical",
"team": "ops",
"summary": "🔴 Baldur service is DEAD",
"description": "No heartbeat received from Baldur '{{ $labels.component }}' for more than 2 minutes. The service may have crashed or is unresponsive. This is a critical infrastructure failure.",
"runbook_url": "https://docs.internal/runbooks/baldur-dead",
},
"BaldurHeartbeatMissing": {
"expr": "absent(baldur_heartbeat_timestamp_seconds) == 1",
"for": "5m",
"severity": "critical",
"team": "ops",
"summary": "🔴 Baldur heartbeat metric missing",
"description": "The Baldur heartbeat metric is completely absent. The service may never have started or is not properly initialized.",
"runbook_url": "https://docs.internal/runbooks/baldur-missing",
},
"OverrideEscalation": {
"expr": "increase(baldur_override_escalation_total[1h]) > 0",
"for": "0m",
"severity": "warning",
"team": "ops",
"summary": "⚠️ Deployment override escalation",
"description": "A deployment override of type '{{ $labels.override_type }}' was approved despite insufficient error budget. This action requires governance review.",
"runbook_url": "https://docs.internal/runbooks/override-escalation",
},
"OverrideEscalationHigh": {
"expr": "increase(baldur_override_escalation_total[24h]) > 5",
"for": "0m",
"severity": "critical",
"team": "ops",
"summary": "🚨 Excessive deployment overrides",
"description": "More than 5 deployment overrides in the last 24 hours. This may indicate process issues or sustained reliability problems.",
"runbook_url": "https://docs.internal/runbooks/override-escalation-high",
},
"XTestCrossRegionDeniedRateHigh": {
"expr": "rate(baldur_xtest_cross_region_denied_total[1m]) > 10",
"for": "1m",
"severity": "warning",
"team": "security",
"summary": "⚠️ High rate of cross-region X-Test denials",
"description": "Cross-region X-Test denial rate exceeds 10/min. Current region: {{ $labels.current_region }}, Target region: {{ $labels.target_region }}. This may indicate misconfigured clients or attempted cross-region access.",
"runbook_url": "https://docs.internal/runbooks/xtest-cross-region-denied",
},
"XTestCrossRegionDeniedFromSameSource": {
"expr": "sum by (current_region, target_region) (increase(baldur_xtest_cross_region_denied_total[5m])) > 5",
"for": "0m",
"severity": "warning",
"team": "security",
"summary": "🔒 Repeated cross-region X-Test denials detected",
"description": "More than 5 cross-region denials in 5 minutes from the same source. This may indicate a security issue or misconfigured automation. Investigate the source of these requests immediately.",
"runbook_url": "https://docs.internal/runbooks/xtest-cross-region-repeated",
},
"TierStarvationNonEssential": {
"expr": '(rate(baldur_rate_controller_dropped_total{tier="non_essential"}[5m]) / (rate(baldur_rate_controller_dropped_total{tier="non_essential"}[5m]) + rate(baldur_rate_controller_processed_total{tier="non_essential"}[5m])) > 0.99) and ((rate(baldur_rate_controller_dropped_total{tier="non_essential"}[5m]) + rate(baldur_rate_controller_processed_total{tier="non_essential"}[5m])) > 10)',
"for": "10m",
"severity": "warning",
"team": "ops",
"summary": "non_essential tier rejecting 99%+ — starvation suspected",
"description": "Over the last 5 minutes, 99%+ of non_essential requests were rejected out of >10 total. If processed_by_tier_total is near 0, this is full starvation. Check the backpressure level and watermark settings.",
"runbook_url": "https://docs.internal/runbooks/tier-starvation",
},
}