Principal Kafka Site Reliability Engineer (DevOps)

palo_alto_networks

Santa Clara, CA

Paid

Responsibilities
We are reshaping the cybersecurity market through our cloud-delivered security services, and our cloud infrastructure is quickly and massively growing with a global footprint. We’re looking for great SREs, as well as software engineers interested in production engineering, to help us scale the largest enterprise security cloud infrastructure in the world.

Description

Palo Alto Networks reinvented the enterprise firewall, growing from a start-up to a multi-billion-dollar company. Our Application Framework, the latest offering in our cloud-delivered security services, ingests security events from hundreds of thousands of firewalls deployed across the globe to provide a massive data analytics platform for deep inspection, anomaly detection, and actionable security automation. Our cloud infrastructure is home to a series of massive and complicated distributed systems and virtualization software platforms which enable big data processing around security services, sandboxing and malware detection, URL categorization and malicious site/domain identification, and security research/response.

RESPONSIBILITIES:
- You will be responsible for maintaining and scaling production Kafka clusters with very high ingestion rates, Zookeeper clusters, as well as other big data pipeline systems such as Kafka and HDFS.
- You will improve scalability, service reliability, capacity, and performance.
- You will write automation code for managing, monitoring, measuring, expanding, and healing clusters.
- You are not an operator, you’re an experienced software engineer focused on operations.
- You will do Kafka tuning, capacity planning, and deep dive troubleshooting.
- You will participate in the occasional on-call rotation supporting the infrastructure.
- You will roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause.
QUALIFICATIONS:
- Hands on experience with managing production Kafka clusters.
- Strong development/automation skills. Must be very comfortable with reading and writing Python. Commits to Kafka source code would be a big plus.
- In-depth understanding of the internals of Kafka cluster management, Zookeeper, partitioning, topic replication and mirroring.
- Very good grasp of monitoring and metrics collection, performance tuning, and troubleshooting complicated situations with distributed systems.
- Tools-first mindset. You build tools for yourself and others to increase efficiency and to make hard or repetitive tasks easy and quick.
- Organized, focused on building, improving, resolving and delivering. Good communicator in and across teams, great teamwork, and a character of taking ownership.
Learn more about Palo Alto Networks here and check out our fast facts #LI-MB1