Who We Are
The Cortex organization provides managed machine learning platforms, tools, processes, and workflows to developers at Twitter. We win when our customers win by helping our users stay informed and share and discuss what matters in service to the public conversation. Twitter is increasingly becoming an AI-first company, and Cortex is at the nexus of that evolution.
Our Cortex SRE team uses state-of-the-art open-source and proprietary technologies. We operate at a scale that few other companies do. We embed deeply with development teams with a focus on up-leveling services and increasing automation. We operate both on-premise and in multiple clouds, and with both online serving and offline modeling services. Joining our team is an opportunity for an SRE to grow into the machine learning world over time, and contribute to a broad range of tasks including contributing directly to the applications.
We care deeply about:
- Enabling Ethical AI.
- Engineering excellence such as good design abstractions, API stability, scaling, leading standard methodologies for other engineers to follow, and solid documentation.
- Staying abreast and compatible with a quickly shifting technology landscape for Machine Learning platform components and related open source solutions.
- Creating the best Machine Learning Platform environment for Twitter that provides an exceptional developer experience for our engineering customers, and provides value to Twitter’s users. We offer Machine Learning as a managed service to the rest of Twitter Engineering.
- Encouraging creativity and innovative solutions.
Our current projects include:
- Creating a high-scale, Kubernetes-based, Machine Learning Model serving solution in a hybrid cloud environment.
- Establishing Kubeflow on GCP as a managed offering at Twitter
- Enabling model training in the GCP environment
- Serving models to partner dev teams using services AWS
- Establishing tooling and other productionalizing infrastructure that spans AWS, GCP, and on-prem environments.
- Enabling and sustaining GCP Infra/Platform components for broader use in Cortex platform; e.g. AI Platform, Dataflow, Data Proc, etc.
- Improving operations of ML Platform services
- Hosted notebooks
- Centralized ML Metastore
- Centralized ML Dashboard
How you'll work:
- Our team focuses on serving ML as a ‘managed service’ to our Twitter engineering customers. This requires an understanding and focus on large-scale online serving systems of all kinds (vs offline like Hadoop).
- You will embed deeply with your Software Engineering (SWE) counterparts and take an active role as a co-owner of production services to ensure services are built, maintained, and operated in a reliable and scalable way.
- You will be part of the successful delivery of new features and services, as well as the day-to-day successful operation of existing services.
- Collaborate with your SWE partners to drive operational health improvements, root cause analysis, postmortem discussions, and their associated remediations that serve to improve reliability and sub-linearly scale operations.
- Partner with both SWE and SRE to use techniques to reduce business risk.
- Perform infrastructure & configuration management, deploys, capacity modeling & planning, and incident mitigation.
- Identify common patterns in challenges with operating services in production, partner with others to design and implement reusable solutions and/or other multi-functional work that drives down the complexity, difficulty, costs, and risks of operating the business.
- You’ll be a member of a service on-call team, in the same on-call group as your SWE partners.
Who you are
We are looking for SREs who are passionate about enabling AI, have a desire to grow themselves and learn new technologies, love working in collaborative teams that are committed to serving their customers. You don’t need to have mastered Machine Learning to join this team!
Your responsibilities include
- Informing and accelerating GCP and AWS Infrastructure adoption standard methodologies (sustaining and improving User Onboarding, IAM, Image Management, Twitter Systems Integrations, Security, et al)
- Traditional SRE/Operational support scopes like automation, monitoring, workflow management, GPU Cluster Management, OS/Kernel Upgrades, RPM/Python Dependency Management, Bare Metal Host Management/Puppet Manifests, CI/CD, monitoring, etc.
- Partnering and supporting existing Cortex Platform teams with Operational guidance and expertise on various project initiatives
- Creating tools and automation for Operational support and management for DS/ML use cases
- Supporting various users and developers with operational issues (e.g. “I’m having trouble scheduling GPU jobs with Persistent Volumes”)
- Capacity Planning and autoscaling.
- Maintaining the version updates of Tensorflow / PyTorch et al
- Partner with Twitter’s Platform and Data Platform organizations to improve, enhance and influence direction and integration opportunities
- Partner with teams to improve, enhance and integrate with the company’s GCP/AWS Adoption & Management strategy