Site Reliability Engineer (Production Systems)
About the Role
A high-impact infrastructure initiative focused on improving the reliability and resilience of production-grade systems is seeking experienced Site Reliability Engineers. This work contributes to advancing AI systems designed to reason about real-world operational challenges, including system failures and infrastructure performance.
This opportunity is ideal for individuals with strong hands-on experience in production environments, particularly those who have operated within high-availability systems and participated in on-call rotations. Candidates with a background in diagnosing complex outages and improving system observability will be well-aligned.
The work involves designing and evaluating realistic incident scenarios, performing root cause analysis, and contributing to system reliability frameworks. Success in this role depends on deep operational insight and the ability to translate real-world engineering challenges into structured problem environments.
What You'll Do
- Design and document realistic production incident scenarios
- Perform detailed root cause analysis on simulated system failures
- Evaluate system behavior across monitoring and alerting frameworks
- Develop scenarios involving capacity planning and system scaling
- Review and refine incident response and post-mortem processes
- Analyze infrastructure reliability across distributed systems
- Collaborate on improving AI understanding of operational best practices
Requirements
- 3+ years of experience in Site Reliability Engineering, DevOps, or production engineering
- Hands-on experience managing production systems with uptime and SLA requirements
- Direct involvement in on-call rotations and incident response workflows
- Strong experience conducting structured root cause analysis (RCA)
- Proficiency with observability tools such as Prometheus, Grafana, Datadog, or PagerDuty
- Deep understanding of Linux systems and networking fundamentals (TCP/IP, DNS, load balancing)
- Experience with containerization and orchestration tools such as Kubernetes and Docker
- Familiarity with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)
- Experience building or maintaining CI/CD pipelines
- Strong debugging skills across application and system layers
- Ability to work independently in a remote, asynchronous environment
- Must be based in the United States
- Preferred: Experience contributing to system design documentation or training datasets for AI systems