Title: IT Operations Engineer (SRE)
Job Type: Contract
Location: Hybrid – Daytona Beach, Florida OR Plano, TX
Job SummaryThe ideal candidate has experience leading root cause analysis in an enterprise environment, with knowledge of the various aspects of IT systems, to include networking, infrastructure (on-prem, hybrid, cloud), endpoints, data, and modern workplace platforms. The ideal candidate has also managed endpoints on an enterprise-level, including policy management, patching, vulnerability management, observability, and their respective strategies. The ideal candidate is knowledgeable of and has employed Site Reliability Engineering best practices, identified areas for automation, and has a continuous improvement mindset.
Qualifications
- Bachelor's degree in computer science, Engineering, or a related field (or equivalent experience).
- 3+ years in a Site Reliability Engineering, Systems Engineering, or similar role.
- Strong experience with cloud platforms such as Azure, AWS or GCP (Azure preferred).
- Proficient in at least one scripting or programming language, such as Python, Go, Bash, or PowerShell.
- Proficient in Power Automate and PowerApps.
- Experience with infrastructure as code tools such as Terraform or Ansible.
- Strong understanding of Linux systems, networking, and performance tuning.
- Experience with monitoring and observability tools such as Azure Monitor, Zabbix, Grafana, Datadog, Dynatrace, LogicMonitor, ControlUp, etc.
- Preferred:
- Familiarity with ITIL/ITSM processes and incident/change management systems.
- Background in security best practices such as least privilege access, secure configurations, and patching.
- Experience supporting large-scale or distributed systems in production.
- Knowledge of FinOps or cost optimization in cloud environments.
- Hands-on experience with containerization and orchestration tools such as Docker or Kubernetes.
- Systems administration experience, to include applying best practices, optimization, and vendor management.
Description and Responsibilities
- Ensure reliability and uptime of production systems through monitoring, incident response, and capacity planning.
- Develop and maintain automated solutions for configuration, deployment, monitoring, and alerting/self-healing.
- Work with application and infrastructure teams to design resilient and scalable systems.
- Participate in on-call rotations, responding to incidents, and performing root cause analysis.
- Define and track SLIs, SLOs, and SLAs, using data to drive operational decisions.
- Continuously improve system performance, cost efficiency, and observability.
- Collaborate with developers to integrate reliability and security best practices into the software development lifecycle.
- Document processes, runbooks, and architectural decisions.
Eligibility: All applications current authorized to live and work in the United States on a Permanent basis are welcome to apply. Must be currently residing in the US. Sponsorship is not available for this position.
Wright Technical Services and our client are Equal Opportunity Employers. We celebrate diversity and are committed to creating an inclusive environment for all employees. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status.