Job Description
Level III Site Reliability Engineers are recognized technical experts who lead complex projects and initiatives, drive innovation, and serve as key resources for both their team and the broader organization. They operate with significant autonomy, solve complex technical problems, and influence technical strategy and process improvements. Level III engineers are expected to mentor others, lead cross-functional efforts, and proactively identify opportunities to enhance reliability, scalability, and efficiency.
Key Responsibilities
- Technical Leadership & Strategy: Lead the design, implementation, and optimization of reliability engineering solutions for mission-critical systems. Serve as a technical advisor to management and cross-functional teams, recommending best practices and innovative approaches. Influence technical decisions and contribute to the development of departmental or area strategy.
- Operational Excellence: Oversee incident response and root cause analysis for high-impact production issues, ensuring rapid resolution and long-term prevention. Develop and refine monitoring and observability frameworks to proactively identify and address reliability and performance issues across multiple services. Drive automation initiatives, creating sophisticated tools and processes to streamline operations and reduce manual intervention.
- Project & Team Leadership: Lead complex projects and initiatives, often spanning multiple teams or departments, with notable risk and complexity. Mentor and provide guidance to junior engineers, fostering a culture of continuous improvement and technical excellence. Act as a resource for colleagues, sharing expertise and building consensus on difficult or sensitive topics.
- Continuous Improvement & Innovation: Proactively identify and solve unique problems that have a broad impact on the business. Develop novel solutions and innovations in tools or processes to improve organizational performance. Contribute to the development of new products, processes, or services through applicable technology.
Competencies
- Expert-level knowledge of SRE concepts, operations, incident response, monitoring, and reliability.
- Demonstrated ability to solve complex technical problems and exercise judgment based on multiple sources of information.
- Recognized as an internal technical expert with broad knowledge across the field of specialization.
- Strong leadership skills; able to lead cross-functional projects and initiatives.
- Excellent communication and influence skills; able to explain complex ideas and persuade senior stakeholders.