SITE RELIABILITY ENGINEER
San Francisco, CA or New York, NY | Full-time | In-Office
\-------------------------------------------------------------
COMPENSATION
Base: $200,000--$300,000
Equity: Competitive
\-------------------------------------------------------------
WHY THIS ROLE
This clinical AI company works with dozens of the nation's leading health systems and helps millions of patients annually get faster access to medication, and reliability on this platform is directly tied to patient outcomes. Well-funded and at 75 people, the company has the engineering maturity of a much larger team: a thoughtful tech stack (Kubernetes, Terraform, OpenTelemetry, Honeycomb), clear SLO-driven operations, and an SRE function that ships code, not just configuration. At this stage, your architectural decisions carry company-wide weight. This is a role for an application-leaning SRE who wants real ownership, not a team of 500 to hide in.
\-------------------------------------------------------------
ABOUT THE ROLE
You'll own the full production environment and improve the development experience by enhancing both infrastructure and application reliability for a clinical AI platform. This is not a pure infra role, you'll be expected to contribute performance and reliability improvements directly to application code, lead incident response with a bias toward durable fixes, and drive SLO-aligned engineering outcomes. Reporting structure and team composition available during the interview process.
\-------------------------------------------------------------
REQUIREMENTS
Experience: 7+ years as a highly technical, application-leaning Site Reliability Engineer
\- Prior experience with 500+ machine deployments
\- Deep expertise in Kubernetes and Helm (deployment, scaling, operational health)
\- CI/CD optimization across TypeScript and Python/ML pipelines
\- Infrastructure as Code using Terraform
\- Ability to define, implement, and evolve SLIs and SLOs
\- OpenTelemetry traces, metrics, and events/logs standardization
\- Performance and scalability diagnosis from trace and metrics data
\- Strong incident response skills with a bias toward durable, code-level fixes
\- Comfortable contributing reliability improvements directly to application code
Tech Stack: Python, TypeScript, Kubernetes, Terraform, OpenTelemetry, Honeycomb
Visa Sponsorship: Available — all types except net new H-1Bs
\-------------------------------------------------------------
LOGISTICS
Location: San Francisco, CA (Financial District) or New York, NY (Midtown)
Work Policy: In-office, 5 days per week