Twitter developed and continually improves a large-scale storage platform. SREs ensure availability of the environment, with a watchful eye on security, capacity, and performance. This group writes software to improve service reliability and handle platform growth. Our tools and services reduce operational overhead and improve performance.
As a Site Reliability Engineer (SRE) in Twitter’s Storage Infrastructure team, you will work to improve the reliability and performance of the next generation of distributed systems and containerized deployments. This team ensures the availability of in-memory data services (including Redis and Memcached), caching content from foundation storage platforms. You will partner with product engineering teams to design, build, operate, and automate distributed storage services at the heart of Twitter’s infrastructure used by millions of people.
We are looking for software engineers that are passionate about reliability, performance, and efficiency, and that have experience building tools, services, and automation to manage and improve production services.
- Build tooling to improve the operations automation. This includes automatic failure detection and remediation, application deployment, OS/kernel deployment, capacity planning, and fleet management.
- Diagnose, and troubleshoot complex distributed systems handling millions of queries per second, petabytes of data, and develop solutions that have a significant impact at our massive scale.
- Collaborate with software engineers to sustain and optimize service availability, reliability, and performance.
- Work and collaborate with the diverse hardware, software, and networking teams throughout the company to craft next-generation distributed storage platforms.
- Troubleshoot issues across the entire stack - hardware, software, application, and network.
- Sustain data privacy and service security compliance.
- Participate in a 24x7 on-call rotation.