Sorry, this listing is no longer accepting applications. Don’t worry, we have more awesome opportunities and internships for you.

Staff ML Infrastructure Software Engineer

Twitter

Staff ML Infrastructure Software Engineer

San Francisco, CA
Full Time
Paid
  • Responsibilities

    Job Description

    Who We Are

    Cortex empowers internal teams to efficiently leverage ML by providing a platform and by unifying, educating, and advancing the state of the art in ML technologies within Twitter. We win when our customers win by helping our users stay informed, share and discuss what matters; by serving the public conversation. We’re building an AI-first company and every major initiative is increasingly dependent on the successful application of machine learning. Cortex is at the nexus of this evolution.

    Our team of ML software engineers are constructing one of the strongest machine learning platforms in the world by marrying the latest ML industry practices with engineering excellence and the need to perform at Twitter scale. Our customers are all the ML engineers at Twitter and our goal is to provide a unified tooling ecosystem that allows these engineers to focus on what they are good at, building ML models with novel approaches, and abstract the way the complexities of bringing these models into a production environment.

    We care deeply about:

    • Engineering excellence such as good design abstractions, API stability, unit testing, leading best practices for other engineers to follow, and solid documentation.

    • Staying abreast and compatible with a quickly shifting technology landscape for ML platform components and related open source solutions.

    • Creating the best ML Platform environment for Twitter that provides an exceptional developer experience for our engineering customers.

    • Encouraging engineering creativity and innovative solutions

    Our Current projects include:

    • Establishing Kubeflow as a managed offering at Twitter

    • Enabling and sustaining GCP Infra/Platform components for broader use in Cortex platform; e.g. AI Platform, Dataflow, Data Proc, etc.

    • Improving Operations of essential ML Platform services

      • Hosted notebooks

      • Centralized ML Metastore

      • Centralized ML Dashboards

    If this sounds like a team you want to be part of, great! We are looking for engineers who are passionate about writing code, have a desire to learn new technologies, love working in collaborative teams, and are committed to serving their customers.

    Your responsibilities include:

    • Informing and accelerating GCP Infrastructure adoption best practices (sustaining and improving User Onboarding, IAM, Image Management, Twitter Systems Integrations, Security et al)

    • Absorbing existing SRE/Operational support scopes (GPU Cluster Management, OS/Kernel Upgrades, RPM/Python Dependency Management, Bare Metal Host Management/Puppet Manifests, etc)

    • Partnering and supporting existing Cortex Platform teams with Operational guidance and expertise on various project initiatives

    • Creating tools and automation for Operational support and management for DS/ML use cases

    • Supporting various users and developers with operational issues (e.g. “I’m having trouble scheduling GPU jobs with Persistent Volumes”)

    • Capacity Planning

    • Maintaining the version updates of Tensorflow / PyTorch et al

    • Partner with Twitter’s Platform and Data Platform orgs to improve, enhance and influence direction and integration opportunities

    • Partner with teams to improve, enhance and integrate with the company’s GCP Adoption & Management strategy

     

     

     

  • Qualifications

    Qualifications

    Who You Are

    • Minimum 6+ years of handling services in a large scale distributed systems environment, preferably services on GCP e.g. BigQuery, etc.

    • Expert knowledge of Linux operating system internals, filesystems, disk/storage technologies and storage protocols and networking stack.

    • Expert knowledge of systems programming (bash and shell tools) and practical, proven knowledge of at least one higher-level language (Python, Go or Scala).

    • Comfortable working with on-prem and cloud-based infrastructure (AWS, GCP) in terms of deployment, support, monitoring, administration and troubleshooting.

    • Experience using containerization software such as: kubernetes, docker, mesos.

    • Track record of practical problem solving, excellent communication, and documentation skills

    • Proven understanding of systems and application design, including the operational trade-offs of various designs.

    • Ability to lead and mentor technical teams through design and implementation across an organization.

    • Work well with and be able to influence a myriad of personalities at all levels.

    • Be adaptable and able to focus on the simplest, most efficient & reliable solutions.

    • Solid understanding of algorithms, distributed systems design and the software development lifecyc

    Additional Information

    All your information will be kept confidential according to EEO guidelines. We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, color, ethnicity, ancestry, national origin, religion, sex, gender, gender identity, gender expression, sexual orientation, age, disability, veteran status, genetic information, marital status or any legally protected status.

    We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

  • Industry
    Media Production