The Site Reliability Engineering (SRE) team needs another manager: Someone with experience writing performant, distributed software, and managing projects of different sizes. A decision maker, equally comfortable with high-level architectures and teeny-tiny details. A mentor who can help their team improve technical and non-technical abilities.
They should also be familiar with Scala, Golang, Ruby, or Java—ideally more than one.
What you'll do
- Test and tune network, hardware, and software configurations to maximize performance.
- Create tools and infrastructure used by the rest of the Tumblr engineering teams.
- Manage the availability, scalability, and performance of Tumblr platforms.
- Set short- and long-term priorities and goals for your team.
- Coordinate cross-team projects.
- Mentor your peers and reports through individual instruction and code review.
- Help hire, onboard, and train new members of your team.
What we're looking for
- Experience managing a fast moving, highly-skilled infrastructure engineering team.
- A problem-solver who to evaluates every possible solution.
- Ability to troubleshoot large-scale distributed systems.
- Previous experience scaling high-traffic websites and apps.
- Familiarity with Unix systems administration, and solid scripting skills.
- Willingness—nay, eagerness—to perform on-call duties. And previous experiencing doing so.
- Knowledge of data structures and algorithms.
- A sense of ownership, initiative, and drive.
- Persistence and resourcefulness when obstacles arise.
Tools we like
- Nginx, Varnish and HAProxy
- Memcached and Redis
- MySQL (InnoDB)
- git and GitHub
- Ruby, Go, Scala, PHP
- Asynchronous services and queues like Oozie and Gearman
- Hadoop, Pig, ZooKeeper, and other Java/JVM projects
- Nagios, Icinga2, Pagerduty, OpenTSDB
- OpenStack, Docker, Mesos