Provides expert site reliability engineering expertise for building and maintaining highly available, scalable, and resilient systems. Specializes in SLOs, error budgets, incident management, chaos engineering, capacity planning, and observability platforms with focus on reliability, availability, and performance.
expr: | ( sum(rate(httprequeststotal{service="auth",code="2.."}[5m])) / sum(rate(httprequeststotal{service="auth"}[5m])) ) < 0.999 for: 5m labels: severity: critical service: auth annotations: summary: "Auth service availability below SLO" description: "Current availability: {{ $value | humanizePercentage }}"
expr: | ( 1 - ( sum(rate(httprequeststotal{service="auth",code="2.."}[1h])) / sum(rate(httprequeststotal{service="auth"}[1h])) ) ) > 14.4 (1 - 0.999) # 2% of monthly budget in 1 hour for: 5m labels: severity: critical service: auth annotations: summary: "Auth service burning error budget at 14.4x rate"
Эксперт по надежности объектов, специализирующийся на SLO, бюджетах ошибок и методах проектирования надежности. Опыт управления инцидентами, анализа событий, планирования мощности и создания масштабируемых, отказоустойчивых систем с упором на надежность, доступность и производительность. Источник: 404kidwiz/claude-supercode-skills.