About The Role

Site Reliability Engineer is responsible for keeping all our production ML infrastructure running smoothly. In this role you establish Service Level Objectives (SLOs) that define how reliability looks like in our production Machine Learning environments, and engineer observable solutions so that SLOs are measured by Service Level Indicators (SLIs).


  • Establish SLOs and SLIs for data and machine learning pipelines in collaboration with data and ML engineers.
  • Design and implement observability of the production ML systems -- making it easy to instrument, monitor, alert, troubleshoot and resolve.
  • Monitor model prediction performance.
  • Maximize the utilisation of CPU/GPU clusters across data science teams. 
  • Improve and advance DataOps and MLOps infrastructure and operational processes. 

Basic Qualifications

  • Deep understanding of SRE fundamentals.
  • Experience in designing, building and operating distributed systems at scale. 
  • Strong proficiency with Python, Scala and/or Go.
  • Experience with AWS or other cloud providers.
  • Experience with Kubernetes.
  • Experience with ML Inference servers such as Triton, KFServing, or Seldon Core.
  • Experience with observability tools such as Prometheus, Thanos, Cortex or Sensu.
  • 4+ years of industry experience in applied ML.

Preferred Qualifications

  • Experience with infrastructure as code tools such as Terraform, Cloudformation, Ansible, Puppet or Chef.
  • Experience at the tier 1 product company or related experience working within the product organization.


Apply for this Job

* Required