The mission of SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of the Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance. It is a new system formed by combining traditional software engineering and technical operation. The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System. From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.), OS optimization, data center and network optimization. We optimize the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.
- Design and develop systems and platforms to improve the stability, scalability, security and efficiency of Shopee.
- Keep improving the utilization of resources and the performance of systems in a quantitative way.
- Keep improving the maintainability of platforms and make them easier to use. Optimize the processes and workflows in the current systems according to developers’ feedback and business requirements to reduce the learning curve.
- Use automation ways or engineering solutions to reduce manual operations and detect potential issues, achieve system self-healing in general cases.
- Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields;
- Passionate about coding and programming, innovation, and solving challenging problems;
- In-depth understanding of computer science fundamentals (data structures and algorithms, operating systems, networks, databases, etc);
- Strong and hands-on experience with at least one of the; programming languages: Go, Python, C++, Java;
- Strong analytical and problem-solving skills with the ability to thrive under difficult and stressful situations;
- Fast learning ability and a good team player;
Skills below are optional but preferable:
- Experience with automation tools like Ansible, SaltStack
- Experience with monitoring tools like Prometheus, Zabbix, Grafana etc
- Experience with load balancing tools like LVS, Nginx, Openresty or HAProxy
- Experience with container technology such as Docker, Kubernetes
- Experience with VM technology such as KVM, Xen, OpenStack
- Experience in design and development of large-scale distributed systems
- Experiences in middleware development, deployment, and operations
- Contributed to open-source projects