Sorry, this listing is no longer accepting applications. Don’t worry, we have more awesome opportunities and internships for you.

Senior MLOps Engineer

XRI GLOBAL

Clarksville, TN

Full Time

Paid

Similar Jobs

Responsibilities
Benefits:

Bonus based on performance

Dental insurance

Health insurance

Paid time off

Vision insurance

Senior MLOps Engineer

Reports to: Head of Machine Learning Location: Clarksville, TN Company: XRI Global, Inc.

About XRI XRI is an AI company specializing in research, data, and development for low resource languages. We are dedicated to enabling speakers of low resource languages to flourish through the development and deployment of advanced language technology solutions.

About the Role

We’re seeking an experienced Senior MLOps Engineer to design, deploy, and maintain our next-generation AI infrastructure. In the near term, you’ll help us architect and implement a GPU-based compute cluster hosted in rented data-center racks. This role will lead the end-to-end process of hardware selection, cluster provisioning, containerized workflows, and scalable training/inference pipelines for large-scale machine learning workloads.

You’ll work closely with our research and platform teams to enable efficient, reproducible, and cost-effective AI model development and deployment at scale.

Key Responsibilities

Infrastructure Design & Buildout

Evaluate GPU hardware and server configurations for cost/performance.

Plan rack layout, power, networking, and cooling requirements in collaboration with data center providers.

Lead provisioning, configuration, and monitoring of the new GPU cluster.

Cluster & Systems Administration

Set up and manage compute orchestration via Kubernetes, Slurm, or Ray across multiple GPU nodes.

Implement high-availability networking and secure access for internal research workloads.

Build automation for node setup (e.g., PXE boot, Ansible, Terraform).

MLOps & Workflow Automation

Develop CI/CD pipelines for model training and deployment (GitHub Actions, Argo, MLflow, etc.).

Implement containerized environments (Docker, Podman) and job scheduling for large distributed training runs.

Manage data lifecycle: versioning, validation, and storage optimization for multi-TB datasets.

Monitoring, Logging, and Optimization

Deploy observability tools (Prometheus, Grafana, Loki) to monitor GPU utilization, network throughput, and storage.

Benchmark and optimize training and inference performance across frameworks (PyTorch, vLLM, DeepSpeed, FSDP, etc.).

Design cost-reporting and utilization dashboards.

Collaboration & Research Enablement

Work with AI researchers to productionize new models.

Maintain reproducible training environments for long-term experimentation.

Contribute to model-deployment strategies (REST/gRPC, Triton, vLLM, FastAPI, etc.).

Qualifications

Required:

5+ years experience in MLOps, HPC, or cloud infrastructure for ML.

Strong Linux administration skills; experience with Ubuntu and NVIDIA drivers.

Experience managing multi-GPU systems, CUDA, and containerized ML workloads.

Familiarity with infrastructure-as-code tools (Terraform, Ansible).

Experience with one or more orchestration systems (Kubernetes, Ray, Slurm).

Deep understanding of distributed training (DDP, FSDP, DeepSpeed, etc.).

Proficiency in Python and shell scripting.

Preferred:

Prior experience standing up GPU clusters or on-prem HPC environments.

Familiarity with model deployment frameworks (vLLM, Triton Inference Server).

Networking knowledge (InfiniBand, RoCE, NVLink, etc.).

Experience setting up monitoring stacks (Prometheus, Grafana, Loki).

Strong understanding of ML lifecycle tools (MLflow, DVC, Weights & Biases).

Familiarity with security, access control, and GPU job scheduling policies.

Nice to Have

Background in large-model training, quantization, and inference optimization.

Familiarity with hybrid on-prem + cloud bursting architectures.

Interest in helping define long-term data-center strategy and procurement.

Collaboration

Reports to Head of ML, works with PMs and broader engineering team.

Mentor colleagues and engage in cross-functional teams.

KPIs

Inference uptime, cost-efficiency, and latency.

Tool adoption, incident resolution, and observability metrics.

Language & Travel

Fluent in English; additional languages a plus. Willing to travel internationally for field use, offsites, and events.

Location & Compensation

Location: Hybrid in Clarksville, TN

Compensation: Competitive salary and eligibility for annual performance bonuses

Culture and Values

XRI values innovation, compassion, problem-solving, and flexibility. Ideal candidates work autonomously, think strategically, and support team growth through knowledge-sharing.

Flexible work from home options available.