As a Site Reliability Engineer (SRE) in Twitter’s Storage Infrastructure team, you will work to improve the reliability and performance of the next generation of distributed systems and containerized deployments. This team ensures the availability of in-memory data services (including Redis and Memcached), and caching content from foundation storage platforms. You will partner with product engineering teams to design, build, operate, and automate distributed storage services at the heart of Twitter’s infrastructure used by millions of people.
We are looking for software engineers that are passionate about reliability, performance, and efficiency, and that have experience building tools, services, and automation to manage and improve production services.
This team has some exciting challenges approaching. Services need to adopt IPv6, transition into Kubernetes, and reimagine elasticity. Opportunities exist for team members to influence how Twitter leverages future caching infrastructure. Work directly with most Twitter engineering teams to improve their caching services interactions.
- Build tooling to improve the operations automation. This includes automatic failure detection and remediation, application deployment, OS/kernel deployment, capacity planning, and fleet management.
- Diagnose, and troubleshoot complex distributed systems handling millions of queries per second, petabytes of data, and develop solutions that have a significant impact at our massive scale.
- Collaborate with software engineers to sustain and optimize service availability, reliability, and performance.
- Work and collaborate with the diverse hardware, software and networking teams throughout the company to design next-generation distributed storage platforms.
- Troubleshoot issues across the entire stack - hardware, software, application and network.
- Produce results for large-scale projects and lead active collaboration across multiple teams.
- Scope work for multiple engineers, often across multiple teams.
- Sustain data privacy and service security compliance.
- Participate in a 24x7 on-call rotation.