Infrastructure Operations Reliability Engineer (Bangalore)


Who We Are:

Twitter’s Infrastructure Operations team is seeking qualified reliability engineering applicants to join our Command Center. Join the team responsible for leading incidents to resolution at twitter. Build tools and instrument automation to make sure that they don't happen again. Our job is to increase reliability of the Twitter service by providing continuous site monitoring, oversight/management of key control processes, effective communication around reliability related events, and write the tools to automate it.


What You’ll Do:

You will be responsible for effectively triaging and troubleshooting a complex environment operating at massive scale. As an incident manager, you will resolve critical system issues on a continuous basis, including notification, coordination and dispatch of individuals from various functional groups. You will take individual ownership of issues and pursue resolution tenaciously. You will provide effective communication and dissemination of information to other teams and executive management. You will develop tools to visualize log data using Python, Javascript, other. We are looking for someone with a variety of deep systems experience, superb communication skills, an attention to both details and the big picture, and a real passion for Twitter.

Who You Are:

  • You have a firm understanding of TCP/ IP Network, SMTP, SSH, DNS, CDN and network security.
  • You have scripting experience (shell, Python, Ruby, and/or Perl), proficiency with at least one or more.
  • You are familiar with Apache Pig, Hadoop, MySQL, Vertica or related technologies
  • You have knowledge of large data center environments.
  • You have experience operating a service in AWS.
  • You have strong Interpersonal and Communication Skills.
  • You have a High Attention to Detail.
  • You have the ability to work independently.
  • You have Availability to work a shift schedule.
  • You have Excellent debugging and analytical reasoning skills
  • You are experienced in building automation to simplify triage, resolution and analysis.
  • You value lightweight but high quality process and documentation with respect to complex systems


  • 2 ~ 5+ yrs of Incident Management experience on a large scale platform.
  • 5 + yrs of experience in a with distributed systems at scale in a Linux/Unix environment as an administrator or developer.
  • B.S. in Computer science or equivalent experience.

Hiring Process

Step 1

After you apply, a recruiter may reach out to you for an introductory call.

Step 2

If your background is a match for the role, you may phone interview with 1-2 people.

Step 3

If you continue through the process, you will come onsite 1-2 times to interview with a total of 5-10 people.


Personal Information

This field is required.
This field is required.
This field is required.
This field is required.
Required field. PDFs only; max file size is 1MB.
Required field. PDFs only; max file size is 1MB.

Twitter does not accept and unsolicited resumes from recruiting agencies and will not pay fees associated with any such resumes. Agencies, please do not send resumes to any Twitter location, employee, or email address.

Thanks for applying!
Submission failed. Please make sure all fields are correctly formatted.

Don't see the right fit?

Check out other opportunities at Twitter.