Senior MLOps Engineer

XRI GLOBAL

Senior MLOps Engineer

Clarksville, TN
Full Time
Paid
  • Responsibilities

    Benefits:

    401(k)

    401(k) matching

    Competitive salary

    Paid time off

    Parental leave

    Job Title: Senior MLOps Engineer

    About XRI

    XRI is an AI company specializing in research, data, and development for low resource languages. We are dedicated to enabling speakers of low resource languages to flourish through the development and deployment of advanced language technology solutions.

    About the role

    We’re looking for the engineer who will own our entire AI backend stack—design decisions, uptime, roadmap, and results. You’ll inherit a production Kubernetes inference cluster (GPU-backed AKS), a Python/Flask API that handles authentication, Stripe billing, and API-key metering, and a growing ML R&D pipeline for new models. The mandate is broad: keep everything humming today while architecting what it should look like 12 months from now. We’ll give you budget, support, and plenty of autonomy—then count on you to drive impact.

    Responsibilities

    Maintain and scale AI backend (AKS, PyTriton, Flask APIs).

    Develop APIs with auth, billing, and usage tracking.

    Enable distributed training on Azure ML with PyTorch DDP/FSDP.

    Build tools for research experimentation and internal efficiency.

    Implement CI/CD with GitHub Actions, Docker, and Terraform.

    Lead observability with Prometheus, Grafana, and SLOs.

    Drive architecture decisions and roadmap execution.

    Qualifications

    Master’s degree in CS, ML, or related field.

    5+ years of experience developing backend infrastructure.

    Proven expertise with Azure Kubernetes Services (AKS).

    Expert in backend infrastructure and Azure Kubernetes.

    Demonstrated proficiency in PyTorch, Python, Flask, Stripe integration, and RESTful APIs.

    Skills

    Advanced Python (3.10+), with production experience using Flask or FastAPI.

    Experience with CI/CD pipelines using GitHub Actions and Docker.

    Familiarity with monitoring tools (Prometheus, Grafana).

    Working knowledge of Terraform, API security best practices, OpenAPI specifications.

    Bonus: Experience with Redis, ReactJS, TypeScript, and Kafka.

    Goals and Objectives

    Short-term: Improve ML systems, set data standards, stabilize backend, and build CI/CD.

    Long-term: Develop scalable infrastructure, crowd-sourced data systems, and support internal adoption.

    Collaboration

    Reports to Head of ML, works with PMs and broader engineering team.

    Mentor colleagues and engage in cross-functional teams.

    KPIs

    Inference uptime, cost-efficiency, and latency.

    Tool adoption, incident resolution, and observability metrics.

    Language & Travel

    Fluent in English; additional languages a plus. Willing to travel internationally for field use, offsites, and events.

    Culture and Values

    XRI values innovation, compassion, problem-solving, and flexibility. Ideal candidates work autonomously, think strategically, and support team growth through knowledge-sharing.