Away is seeking a Director of Site Reliability Engineering to join our team. The ideal candidate is interested in building scalable infrastructure, adding system resiliency, improving developer productivity, automating everything that can (and should) be automated, as well as being a thoughtful people manager and leader. They will oversee a small team of SRE engineers responsible for overall site health and reducing operational issues as well as the long term strategy for our infrastructure.
This position is based out of headquarters in SoHo, New York City and reports to the VP of Engineering.
What you’ll do:
Conduct strategy and long term roadmap of our site reliability function and technology infrastructure
Build out an SRE team from scratch
Work on projects to improve scalability of our systems as we support more countries, fulfillment centers, shipping carriers, technology platforms, etc.
Owning site uptime, monitoring/alerting, CI/CD, cloud networking, security, and overall performance
Establish and monitor KPIs for reliability, throughput, quality, and controls; deliver dashboards that provide operational and executive views
Being a technology leader, contributor to projects, including some coding, code reviews, and architectural discussions
Partner with Application Engineering to maximize platform reliability through code, tools and monitoring improvements
7+ years of site reliability engineering, devops, or related infrastructure experience
3+ years of engineering management experience
Experience with cloud infrastructure (AWS and Azure)
BS degree in Computer Science, similar technical field of study, or equivalent practical experience
2+ years of retail and/or e-commerce experience
Proficiency in Ruby on Rails or Node
Experience with data streaming platforms like Apache Kafka