We are looking for an experienced Site Reliability Engineer to lead our efforts to build the next generation of monitoring and observability infrastructure at OfferUp. We provide tools and services to all teams in OfferUp for managing an increasingly complex production infrastructure handling billions of requests per day. Our success is measured by our ability to allow everyone to stand up and deploy services quickly with no downtime. In this role, you will be at the forefront of driving and developing the technology that improves the availability, scalability, performance and reliability of OfferUp.
- Work with other SREs to build a comprehensive set of tools to monitor our production infrastructure to detect issues before users do.
- Enhance the observability of our systems to reduce time to answer why an issue happened.
- Work with other engineering teams to build resilient, operable, self-healing services
- Participate in reasonable on-call rotations with the rest of Engineering
- Practice sustainable incident response and blameless postmortems
- You will mentor SREs on standard methodology for everything from monitoring to troubleshooting complex code issues
- Previous experience architecting, building and deploying monitoring and observability systems. Preferrably with statsd/Datadog, Prometheus, and SumoLogic.
- Solid understanding of systems and application design, including the operational trade-offs of various designs.
- Minimum of 5+ years managing servers, preferably in AWS, at scale
- Ability to lead technical teams through design and implementation across an organization
- Reasonably deep knowledge of Linux and internet technologies
- Practical knowledge of various aspects of service design like messaging protocols & behavior, caching strategies and software design practices.
Nice to have
- Experience with distributed tracing and the Cloud Native Computing Foundation technology stack.
- Previous experience driving adoption of new systems across engineering teams
- Contribution to open source projects
- An active interest in serverless computing and containerization
- Collaborates and works as a team
- Avoids doing things twice
- Solves hard problems for tomorrow, not just for today
- Stays positive and prefers fixing problems to complaining about them
- Investigates, considers and adopts new technology where it makes sense
- Doesn’t tolerate brilliant jerks