About the Role

A high-impact infrastructure initiative focused on improving the reliability and resilience of production-grade systems is seeking experienced Site Reliability Engineers. This work contributes to advancing AI systems designed to reason about real-world operational challenges, including system failures and infrastructure performance.

This opportunity is ideal for individuals with strong hands-on experience in production environments, particularly those who have operated within high-availability systems and participated in on-call rotations. Candidates with a background in diagnosing complex outages and improving system observability will be well-aligned.

The work involves designing and evaluating realistic incident scenarios, performing root cause analysis, and contributing to system reliability frameworks. Success in this role depends on deep operational insight and the ability to translate real-world engineering challenges into structured problem environments.

What You'll Do

Design and document realistic production incident scenarios
Perform detailed root cause analysis on simulated system failures
Evaluate system behavior across monitoring and alerting frameworks
Develop scenarios involving capacity planning and system scaling
Review and refine incident response and post-mortem processes
Analyze infrastructure reliability across distributed systems
Collaborate on improving AI understanding of operational best practices

Requirements

3+ years of experience in Site Reliability Engineering, DevOps, or production engineering
Hands-on experience managing production systems with uptime and SLA requirements
Direct involvement in on-call rotations and incident response workflows
Strong experience conducting structured root cause analysis (RCA)
Proficiency with observability tools such as Prometheus, Grafana, Datadog, or PagerDuty
Deep understanding of Linux systems and networking fundamentals (TCP/IP, DNS, load balancing)
Experience with containerization and orchestration tools such as Kubernetes and Docker
Familiarity with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)
Experience building or maintaining CI/CD pipelines
Strong debugging skills across application and system layers
Ability to work independently in a remote, asynchronous environment
Must be based in the United States
Preferred: Experience contributing to system design documentation or training datasets for AI systems

Site Reliability Engineer (Production Systems)

About the Role

What You'll Do

Requirements

Explore Similar Global AI Roles

MLOps Engineer – AI Model Training Infrastructure

Frontend Engineer – AI Coding Systems Evaluation

Internal Medicine Clinical Reasoning Expert (AI Evaluation)