The mission of SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of the Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance. It is a new system formed by combining traditional software engineering and technical operation. The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System. From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.), OS optimization, data center and network optimization. We optimize the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.
- Responsible for the architecture design and function development of Shopee basic monitoring. including monitoring data processing, incidents root cause analyzing, standardize alarms, etc. To provide the accurate and real-time full-view monitoring abilities for online services like servers, network, DB, containers, etc.
- Responsible for constructing the regulations of monitoring and alarms: monitoring ingestion standards, emergency alarm response standards, routine health inspections, etc.
- Analyze the data in the monitoring system and provide analysis results to prevent problems and failures.
- Continue to evolve and optimize the monitoring platform.
- Bachelor's or higher degree in Computer Science or related fields.
- Familiar with Golang/Python, more than 2 years of development experience, corresponding project experience is preferred.
- Familiar with the principle and usage of Prometheus/Open-Falcon/Zabbix, etc. Secondary development experience is preferred.
- Have a certain understanding of high concurrency and high availability system design, and have experience in distributed system development.
- Strong technical enthusiasm, willing to study technology and innovative spirit is preferred.