Sorry, this listing is no longer accepting applications. Don’t worry, we have more awesome opportunities and internships for you.

Senior MLOps Engineer

XRI GLOBAL

Senior MLOps Engineer

Clarksville, TN
Full Time
Paid
  • Responsibilities

    Benefits:

    Bonus based on performance

    Dental insurance

    Health insurance

    Paid time off

    Vision insurance

    Senior MLOps Engineer

    Reports to: Head of Machine Learning Location: Clarksville, TN Company: XRI Global, Inc.

    About XRI XRI is an AI company specializing in research, data, and development for low resource languages. We are dedicated to enabling speakers of low resource languages to flourish through the development and deployment of advanced language technology solutions.

    About the Role

    We’re seeking an experienced Senior MLOps Engineer to design, deploy, and maintain our next-generation AI infrastructure. In the near term, you’ll help us architect and implement a GPU-based compute cluster hosted in rented data-center racks. This role will lead the end-to-end process of hardware selection, cluster provisioning, containerized workflows, and scalable training/inference pipelines for large-scale machine learning workloads.

    You’ll work closely with our research and platform teams to enable efficient, reproducible, and cost-effective AI model development and deployment at scale.

    Key Responsibilities

    Infrastructure Design & Buildout

    Evaluate GPU hardware and server configurations for cost/performance.

    Plan rack layout, power, networking, and cooling requirements in collaboration with data center providers.

    Lead provisioning, configuration, and monitoring of the new GPU cluster.

    Cluster & Systems Administration

    Set up and manage compute orchestration via Kubernetes, Slurm, or Ray across multiple GPU nodes.

    Implement high-availability networking and secure access for internal research workloads.

    Build automation for node setup (e.g., PXE boot, Ansible, Terraform).

    MLOps & Workflow Automation

    Develop CI/CD pipelines for model training and deployment (GitHub Actions, Argo, MLflow, etc.).

    Implement containerized environments (Docker, Podman) and job scheduling for large distributed training runs.

    Manage data lifecycle: versioning, validation, and storage optimization for multi-TB datasets.

    Monitoring, Logging, and Optimization

    Deploy observability tools (Prometheus, Grafana, Loki) to monitor GPU utilization, network throughput, and storage.

    Benchmark and optimize training and inference performance across frameworks (PyTorch, vLLM, DeepSpeed, FSDP, etc.).

    Design cost-reporting and utilization dashboards.

    Collaboration & Research Enablement

    Work with AI researchers to productionize new models.

    Maintain reproducible training environments for long-term experimentation.

    Contribute to model-deployment strategies (REST/gRPC, Triton, vLLM, FastAPI, etc.).

    Qualifications

    Required:

    5+ years experience in MLOps, HPC, or cloud infrastructure for ML.

    Strong Linux administration skills; experience with Ubuntu and NVIDIA drivers.

    Experience managing multi-GPU systems, CUDA, and containerized ML workloads.

    Familiarity with infrastructure-as-code tools (Terraform, Ansible).

    Experience with one or more orchestration systems (Kubernetes, Ray, Slurm).

    Deep understanding of distributed training (DDP, FSDP, DeepSpeed, etc.).

    Proficiency in Python and shell scripting.

    Preferred:

    Prior experience standing up GPU clusters or on-prem HPC environments.

    Familiarity with model deployment frameworks (vLLM, Triton Inference Server).

    Networking knowledge (InfiniBand, RoCE, NVLink, etc.).

    Experience setting up monitoring stacks (Prometheus, Grafana, Loki).

    Strong understanding of ML lifecycle tools (MLflow, DVC, Weights & Biases).

    Familiarity with security, access control, and GPU job scheduling policies.

    Nice to Have

    Background in large-model training, quantization, and inference optimization.

    Familiarity with hybrid on-prem + cloud bursting architectures.

    Interest in helping define long-term data-center strategy and procurement.

    Collaboration

    Reports to Head of ML, works with PMs and broader engineering team.

    Mentor colleagues and engage in cross-functional teams.

    KPIs

    Inference uptime, cost-efficiency, and latency.

    Tool adoption, incident resolution, and observability metrics.

    Language & Travel

    Fluent in English; additional languages a plus. Willing to travel internationally for field use, offsites, and events.

    Location & Compensation

    Location: Hybrid in Clarksville, TN

    Compensation: Competitive salary and eligibility for annual performance bonuses

    Culture and Values

    XRI values innovation, compassion, problem-solving, and flexibility. Ideal candidates work autonomously, think strategically, and support team growth through knowledge-sharing.

    Flexible work from home options available.