Staff Site Reliability Engineer (SRE) - Infrastructure SRE
Who We Are:
Site Reliability Engineering at Twitter is responsible for the performance, reliability, and scalability of Twitter services in production. We build software to automate, optimize, manage, and maintain those services; driving down technical debt, operational cost, and toil every step of the way. We are the last line of defense for the Twitter platform, the chosen few tasked with keeping the tweets flowing.
As Site Reliability Engineers in a team supporting Druid as a service in Twitter, our mission is to build powerful solutions to make data accessible to a broad set of technical and non-technical customers for slice and dice analytics on both historical and real-time metrics. We work at an enormous scale -- Terabytes of data are being collected every day and we make it searchable in seconds. Advertisers, data scientists and engineers need that data to be broken down by market segments and user attributes for real time insights.
What you will do:
- You will work build and scale our 1000+ node Druid interactive query infrastructure and help to define, architect and build the next-generation engagement data processing architecture for advertising campaigns.
- You will join passionate engineering team working on building Druid as a multi-tenant Platform service, which would be available to any Twitter engineering team to make better decisions for our customers.
- You will work closely with product managers, data analysts, data scientists, and other engineers to build and maintain a robust data products. You will build and use the latest highly scalable and performant systems to process dozens of terabytes of data a day.
- Your efforts will reveal invaluable business and user insights, leveraging vast amounts of Twitter revenue data to fuel numerous Revenue teams including Ads Analytics, Ads Experience, Ads Data Science, Marketplace, Targeting, Prediction, and many others.
- You will troubleshoot issues across the entire stack: hardware, software, application and network
- You will identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services.
- You will take part in 24x7 on-call rotation
- You will Represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
Who You Are:
- You have solid experience building and operating Druid clusters on-prem or in the cloud.
- You have a solid understanding of systems and application design, including the operational trade-offs of various designs.
- You have practical, solid knowledge of shell scripting and at least one higher-level language (Python or Java)
- You have an expert understanding of Linux systems, services, optimization, storage subsystems, and file systems
- You have 5 or more years experience handling services in a large scale environment, if you have less we will still consider.
- You work well with and be able to influence a myriad of personalities at all levels.
- You are able to prioritize tasks and work independently.
- You are adaptable and able to focus on the simplest, most efficient & reliable solutions.
- You have a track record of successful practical problem solving, excellent written and social communication, and documentation skills.
- B.S. in computer science or similar experience.
- Ability to lead technical teams through design and implementation across an organization.
- Experience designing fault-tolerant distributed system
- Experience with Hadoop or other MapReduce-based architecture
- Experience with real-time streaming (Apache Kafka, Apache Beam, Heron, Spark Streaming
- Experience with coordination (Apache Zookeeper)
- Experience with compute (Apache Mesos, GCE, GKE)
- Proficiency with SQL (Relational, Hive, Presto, MySQL)
Engineering Hiring Process
Once your application is received, a recruiter will reach out pending your qualifications are a match for the role.
If your background is a match, you may have 1-2 technical phone interviews or be given the chance to provide a work sample depending on the role.
If the phone interviews go well or your work sample is strong, the final step includes interviews with 5-6 people held onsite in our office.