Senior/Staff Site Reliability Engineer - San Francisco
Job Title: Senior / Staff Site Reliability Engineer
Location: San Francisco or New York City (Hybrid, 3+ days/week in-office)
Compensation: $211,000 – $248,000 base salary + competitive equity
About the Company
High-growth, AI-focused healthcare technology company building secure, cloud-native products used by clinicians and health systems. The team cares about reliability, performance, and enabling engineers to move fast without breaking things.
Role Overview
We’re hiring a Senior / Staff Site Reliability Engineer to focus on application performance, reliability, and platform scale. This role sits between backend engineering and platform/SRE: you will design and run load and chaos tests, use observability and profiling tools to find bottlenecks, and then drive the code and infrastructure changes needed to fix them.
You’ll often embed with product teams for weeks or months at a time, helping them rehome services to more scalable infrastructure and adopt better SLOs, error budgets, and incident practices as the platform grows.
What You’ll Do
Use load testing, chaos engineering, and other test practices to identify performance and latency issues across services, and fix them in application code
Drive software changes that move applications to more scalable infrastructure (runtimes, event-driven systems, databases, multi-tenant setups)
Tune application and infrastructure configuration to improve performance and scalability
Build internal tools and modules that help engineers ship safer, faster, and with better defaults
Work with the Platform team to shape and roll out elements of the internal developer platform (service templates, self-serve infrastructure, etc.)
Partner with application teams to define and adopt SLOs, error budgets, and health metrics that support canary releases and better monitoring
Improve incident response by strengthening observability, dashboards, runbooks, and on-call practices
Document patterns, run training, and coach teams on cloud-native and performance best practices
Represent the work in the broader platform/SRE community (talks, OSS, etc.) as interest and time allow
Tech Stack
Kubernetes, Terraform, GCP
Python, TypeScript
Datadog, Grafana, and related observability tools
Requirements
5–10 years of experience as a backend or platform/SRE engineer working on distributed systems or developer tooling
At least 2 years focused on system performance and scalability at the application layer
Experience improving reliability and scalability of production systems (e.g., service migrations, database performance work, resilience improvements)
Proven examples of reducing latency by multiples using observability and profiling tools
Hands-on experience building on Kubernetes and scaling compute services on Kubernetes
Experience with at least one major cloud provider (GCP preferred)
Strong skills in Python and TypeScript (or strong in one and willing to ramp on the other)
Experience using Datadog, Grafana, or similar tools for monitoring and alerting
Clear interest in reliability, scalability, and engineering enablement (not just product features)
Willingness and ability to work in person at least 3 days per week in San Francisco or New York City
What We’re Looking For (Green Flags)
Core skill: application-layer performance optimization with real examples of cutting latency by multiples
Domain expertise: Kubernetes + Python/TypeScript + distributed systems scaling experience
Scale experience: has supported systems during rapid growth or major traffic spikes
Success pattern: has migrated production systems, improved database performance, or built internal developer tooling that changed how teams work
Traits That Are Not a Fit
Mainly interested in backend product features rather than reliability and platform work
Multiple short tenures (less than ~2 years in most recent roles)
High ego or low interest in cross-team collaboration
Less than 5–6 total years of software engineering experience
If you like solving hard performance problems, improving reliability at scale, and raising the bar for how teams ship software, we’d like to hear from you.