Senior Site Reliability Engineer I

American Express

Phoenix, AZ

Full Time

Paid

Responsibilities
JOB DESCRIPTION

Joining Amex Tech means discovering and shaping your contribution to something big. Here, you can work alongside talented tech teams and build a unique career with the Powerful Backing of American Express. With a range of opportunities to work with the latest technologies, and a commitment to back the broader engineering community through open source, our mission is to power your success. Because Amex Tech is powered by our technology, our culture, and our colleagues.

Sr Site Reliability Engineer I develops and implements Site Reliability Engineering (SRE) strategies, ensures real-time system observability, promotes best practices and automation, and collaborates with cross-functional teams to enhance system reliability and customer experiences while mentoring junior engineers.

RESPONSIBILITIES

Mentors junior Site Reliability Engineers and cross-functional team of colleagues, fostering a culture of excellence and innovation

Provides guidance and support to junior engineers, fostering professional growth and development within the team, ensuring adherence to best practices in Site Reliability Engineering

Manages and oversees collaboration with Software Engineering teams to design, develop, and implement advanced features that enhance system resilience, scalability, and performance, proactively identifying and resolving complex system bottlenecks and failure points

Leads the development and refinement of sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline complex operational workflows, deployment processes, and infrastructure management, significantly reducing manual intervention and ensuring high system efficiency

Actively engages in and influences high-level architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are deeply integrated into strategic decision-making processes, and driving the adoption of innovative solutions

Designs, executes, and oversees comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhances system robustness and recovery capabilities, and mentors colleagues in these practices

Leads the development, optimization, and maintenance of comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions

Advocates for and implements advanced observability practices, including error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability, and mentoring colleagues in these practices

Collaborates with cross-functional teams to enhance customer journeys, ensuring seamless and reliable technology experiences by addressing potential reliability and performance issues proactively, and leading initiatives to improve overall system reliability

Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

QUALIFICATIONS

Education Qualifications:

Bachelor’s degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred

8+ years of experience in software engineering and application development with strong proficiency in Java/J2EE, Python, Kotlin, Spring Boot, SQL, NoSql.

Knowledge of modern observability stack - Splunk, Elastic Search, Prometheus, Grafana

Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture

Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms

Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud

Work Experience:

Experience in software development, or technology operations, with a focus on Site Reliability Engineering

Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)

Licenses and Certifications:

Advanced certification in Site Reliability Engineering (SRE) or related is a plus

Employment eligibility to work with American Express in the United States is required as the company will not pursue visa sponsorship for these positions.
Industry
Financial Services