Senior/Staff Site Reliability Engineer - San Francisco

IKR Enterprises

Senior/Staff Site Reliability Engineer - San Francisco

San Francisco, CA
Full Time
Paid
  • Responsibilities

    Job Title: Senior / Staff Site Reliability Engineer
    Location: San Francisco or New York City (Hybrid, 3+ days/week in-office)
    Compensation: $211,000 – $248,000 base salary + competitive equity

    About the Company
    High-growth, AI-focused healthcare technology company building secure, cloud-native products used by clinicians and health systems. The team cares about reliability, performance, and enabling engineers to move fast without breaking things.

    Role Overview
    We’re hiring a Senior / Staff Site Reliability Engineer to focus on application performance, reliability, and platform scale. This role sits between backend engineering and platform/SRE: you will design and run load and chaos tests, use observability and profiling tools to find bottlenecks, and then drive the code and infrastructure changes needed to fix them.

    You’ll often embed with product teams for weeks or months at a time, helping them rehome services to more scalable infrastructure and adopt better SLOs, error budgets, and incident practices as the platform grows.

    What You’ll Do

    • Use load testing, chaos engineering, and other test practices to identify performance and latency issues across services, and fix them in application code

    • Drive software changes that move applications to more scalable infrastructure (runtimes, event-driven systems, databases, multi-tenant setups)

    • Tune application and infrastructure configuration to improve performance and scalability

    • Build internal tools and modules that help engineers ship safer, faster, and with better defaults

    • Work with the Platform team to shape and roll out elements of the internal developer platform (service templates, self-serve infrastructure, etc.)

    • Partner with application teams to define and adopt SLOs, error budgets, and health metrics that support canary releases and better monitoring

    • Improve incident response by strengthening observability, dashboards, runbooks, and on-call practices

    • Document patterns, run training, and coach teams on cloud-native and performance best practices

    • Represent the work in the broader platform/SRE community (talks, OSS, etc.) as interest and time allow

    Tech Stack

    • Kubernetes, Terraform, GCP

    • Python, TypeScript

    • Datadog, Grafana, and related observability tools

    Requirements

    • 5–10 years of experience as a backend or platform/SRE engineer working on distributed systems or developer tooling

    • At least 2 years focused on system performance and scalability at the application layer

    • Experience improving reliability and scalability of production systems (e.g., service migrations, database performance work, resilience improvements)

    • Proven examples of reducing latency by multiples using observability and profiling tools

    • Hands-on experience building on Kubernetes and scaling compute services on Kubernetes

    • Experience with at least one major cloud provider (GCP preferred)

    • Strong skills in Python and TypeScript (or strong in one and willing to ramp on the other)

    • Experience using Datadog, Grafana, or similar tools for monitoring and alerting

    • Clear interest in reliability, scalability, and engineering enablement (not just product features)

    • Willingness and ability to work in person at least 3 days per week in San Francisco or New York City

    What We’re Looking For (Green Flags)

    • Core skill: application-layer performance optimization with real examples of cutting latency by multiples

    • Domain expertise: Kubernetes + Python/TypeScript + distributed systems scaling experience

    • Scale experience: has supported systems during rapid growth or major traffic spikes

    • Success pattern: has migrated production systems, improved database performance, or built internal developer tooling that changed how teams work

    Traits That Are Not a Fit

    • Mainly interested in backend product features rather than reliability and platform work

    • Multiple short tenures (less than ~2 years in most recent roles)

    • High ego or low interest in cross-team collaboration

    • Less than 5–6 total years of software engineering experience

    If you like solving hard performance problems, improving reliability at scale, and raising the bar for how teams ship software, we’d like to hear from you.