Software Engineer (Observability & Monitoring) - West Des Moines, IA
Job Description:
Overview:
· Seeking an experienced Observability and Monitoring Engineer to build and mature our enterprise-wide monitoring, logging, alerting, and observability capabilities across our AWS-based technology stack.
· This role will define the strategy, architecture, implementation standards, and dashboards that enable proactive detection, faster troubleshooting, and data-driven insights across applications, infrastructure, operating systems, databases, file transfers, and batch processes.
· The ideal candidate has hands-on engineering expertise, strong architecture skills, and the ability to unify multiple monitoring solutions into a cohesive observability framework.
Responsibilities:
· You will establish standards for logs, metrics, traces, event correlation, and alert across multiple environments
· You will build centralized dashboards and alerting policies that provide unified visibility across: applications & services, operating systems, AWS services (EC2, RDS, Lambda, S3, CloudWatch, CloudTrail, etc.), databases (MS SQL Server, PostgreSQL, etc.), file transfer systems (SFTP, managed transfer tools), batch jobs and scheduled processes.
· You will create actionable and noise-free alerting thresholds, escalation policies, and runbooks.
· You will integrate existing tools (Dynatrace, Graylog, Splunk, SolarWinds, Zabbix) into a cohesive ecosystem.
· You will rationalize tool usage and recommend consolidation or modernization where appropriate.
· You will manage the lifecycle, configuration, tuning, and health of monitoring and logging platforms, automate monitoring deployments using IaC (CloudFormation) and CI/CD pipelines, and develop reusable templates/standards so teams can onboard new applications quickly.
· You will build self-service dashboards and reporting for technical/business stakeholders, create documentation for monitoring standards, dashboard naming conventions, logging schemas, and alert configuration guidelines.
· You will define SLOs/SLIs and reliability KPIs for critical services.
· You will partner with scrum teams, infrastructure, and security teams to reduce MTTR and improve system reliability, participate in incident resolution, root cause analysis, and problem management.
· You will provide technical leadership/mentoring to team members and consult on architecture decisions and best practices.
· You will Develop/maintain system documentation and participate in project planning and technical strategy sessions.
Qualifications:
· Bachelor's degree in Computer Science or related field
· 5+ years of experience implementing monitoring and observability using Dynatrace
· Hands-on experience with monitoring/logging tools such as Zabbix, Graylog, Splunk, SolarWinds, or equivalents
· 5+ years of hands-on experience with AWS services and architecture
· Deep understanding of metrics, logs, traces, distributed tracing, and event correlation
· Experience building dashboards and KPIs for application, infrastructure, and database layers
· Strong scripting/automation skills (Python, Bash, PowerShell) and familiarity with Terraform or CloudFormation
· Strong understanding of network monitoring, performance tuning, and systems architecture
· Familiarity with ITIL incident/problem management processes
· Proficiency with AI tools and using them responsibly in improving observability preferred
· Experience with container orchestration and microservices architecture preferred
· Experience with AWS OpenTelemetry, Prometheus, Grafana, or similar tools preferred
Required Technical Skills:
• AWS Services (EC2, RDS, S3, Lambda, ECS/EKS, etc.)
• Configuration Management (Ansible, Puppet, Chef)
• Monitoring Tools (Dynatrace, CloudWatch, Zabbix, Solarwinds, Graylog etc.)
• CI/CD Tools (Jenkins, Quickbuild, Bitbucket)
• Scripting Languages (Python, PowerShell, Bash)
• Database Management (MS SQL Server, PostgreSQL)
• Infrastructure as Code (Terraform, CloudFormation)
• Container Technologies (Docker, Kubernetes)