The mission of  SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of the Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance. It is a new system formed by combining traditional software engineering and technical operation.  The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System. From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.), OS optimization, data center and network optimization. We optimize the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.

Job Description:

  • Responsible for maintaining big data system such as Hadoop/Spark/Storm/Kafka
  • Responsible for big data ops architecture review, capacity planning, cost optimisation, tracking and troubleshooting, and building a big data monitoring system to maintain overall stability and efficiency.
  • Deeply participate in big data related businesses, such as search engine, deep learning, and promote the sustainable development of big data business
  • Responsible for the maintenance and development of the BigData ops automation platform, and improve the operation and maintenance management level of big data.

Requirements:

  • Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields
  • Familiar with BigData platform such as Hadoop/zookeeper/redis/kafka/SPARK/MQ/ELK.
  • Have a certain programming foundation, familiar with the common python/golang background development framework.
  • More than 2 years experience in related fields, familiar with large-scale big data operation and maintenance architecture solutions are preferred;
  • Excellent communication, expression and organizational collaboration teamwork ability, adapt to a diversified international working environment, and have certain English ability.

Skills below are optional but preferable:

  • Experience with the development of BigData automation operation platform is preferred
  • Ability with Hadoop/Ceph development is preferred.

Apply for this Job

* Required