About the Role
As a Site Reliability Engineer on this team, you will be responsible for the infrastructure of our core backend system and other future services. You’ll be tasked with optimizing our current infrastructure, identifying areas for improvement, and building tools and systems to automate processes where we can. With you at the helm, our systems should become more performant, reliable, and available.
- Manage the full lifecycle of services -- from initial setup and release to day-to-day CI/CD. Should be passionate about building software to streamline and automate service deployment and operation for engineers.
- Design and develop container-based runtime infrastructure and develop tooling for handling the infrastructure (e.g. compute, storage, observing, caching, messaging, etc).
- Design and develop tools and services to accelerate and opt imize developer velocity.
- Maintain services by monitoring and measuring availability, latency, and overall system health and iteratively drive improvement in these areas by introducing new technologies or developing new tools.
- Provide vision and guidance for the evolution of our service architecture, with a focus on scalability and reliability.
- Assist in early-stage planning for new services through architecture and systems design reviews.
- Evangelize and practice effective incident response management.
- Bachelor's degree or higher in Computer Science or a related technical field.
- 5+ years of experience with building and running large-scale, massively distributed, fault-tolerant systems.
- Deep understanding of architecture, design and operating of Cloud platforms such as Alibaba Cloud, Azure, GCP, or AWS, and proficiency with infrastructure automation technologies such as Terraform, AWS CloudFormation, Ansible.
- Demonstrated hands-on experience in containerization and orchestration technologies (e.g. Docker and Kubernetes).
- Experience in algorithms, data structures, complexity analysis, and software design.
- Hands-on coding experience with one or more languages such as Jave, Python, Go, Ruby, or similar, experience in web service development is a plus.
- Experience in production logging, monitoring and alerting, such as ELK, Prometheus, and Grafana.
- Understanding of Unix/Linux operating systems internals and administration (e.g. filesystems, shell scripting, system calls, etc) or networking (e.g. TCP/IP, routing, network topologies, etc).
- Strong ability to analyze and debug complex software and infrastructure issues, and develop tools/systems for task automation.
- Excellent communication skills -- need to be able to effectively manage communication with multiple engineering teams.