The mission of  SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of the Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance. It is a new system formed by combining traditional software engineering and technical operation.  The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System. From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.), OS optimization, data center and network optimization. We optimize the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.

Job Description:

  • Responsible for maintaining MiddleWare and Distributed File system such as Redis, Ceph , Kafka, etc.
  • Responsible for tech architecture review, capacity planning, cost optimisation, tracking and troubleshooting, and building acomponent monitoring system to maintain overall stability and efficiency.
  • Responsible for the maintenance and development of the MiddleWare ops automation platform, and improve the operation and maintenance management level of MiddleWare and Distributed File system.
  • Owner and the first incident responder for MiddleWare/DFS component

Requirements:

  • Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields
  • Less than 1 year of experience welcomed
  • Familiar with MiddleWare or Distributed File System such as Redis/kafka/SPARK/Rabbit MQ/ELK
  • Have a certain programming foundation, familiar with the common python/golang background development framework.
  • More than 3 years experience in related fields, familiar with large-scale operation and maintenance .
  • Excellent communication, expression and organizational collaboration teamwork ability, adapt to a diversified international working environment, and have certain English ability.

Skills below are optional but preferable:

  • Experience with the development of Redis/Kafka/Ceph automation operation platform is preferred
  • Ability with HDFS/Ceph development is preferred.
  • Experience with Service Mesh is preferred

Apply for this Job

* Required