Seldon was founded in 2014 with a simple yet ambitious aim: accelerate the adoption of machine learning to solve some of the world’s most challenging problems. Since 2018, Seldon has raised €10 million in funding and achieved 38% MoM growth, working with partners such as Google, RedHat and IBM and leading enterprise clients.
We have created a culture that we’re proud of driven by our passionate, talented team and a truly collaborative ethos. Our workplace is engineering-led and design-driven, offering an agile and fast-paced environment that is evolving as we scale, enabling unique opportunities to grow and develop your career.
We’re looking for a Site Reliability Engineer (SRE) to join our team. We are focused on making it easy for machine learning models to be deployed and managed at scale in production. We provide Cloud Native products that run on top of Kubernetes and are open-core with several successful open source projects including Seldon Core, Alibi:Explain and Alibi:Detect. We also contribute to open source projects under the Kubeflow umbrella including KFServing.
You will be joining Seldon in to help us expand our product range with confidence across various cloud and on-prem installations. It’s a key initial hire for us in the area of SRE and you will be pivotal in establishing our approach and core best practices.
What you will be doing
- Extend open and closed products to allow for Cloud Marketplace publishing
- Build observability and monitoring for internal services to ensure reliability
- Manage our Marketplace releases including Google and Redhat marketplace
- Manage internal services on cloud Infrastructure
- Establish and maintain CI/CD pipelines for core open and closed source projects
- Develop and improve the reliability process across the engineering team
- Manage team IAM across infrastructure used by the team
- Working collaboratively across our Technology team
What skill you will bring to the team
- A minimum of two years industry experience or academia showing completed projects
- A degree or higher level academic background in a scientific or engineering subject
- Experience working with Linux and the Unix Shell
- A proactive approach to continuous improvement and problem solving
Core skills required, developed either through existing experience or with a demonstrable desire to learn:
- Familiarity with Cloud Infrastructure (GCP, AWS)
- Implementing "Infrastructure as Code" (Terraform or similar technologies)
- Managing Kubernetes and Docker environments
- Knowledge of monitoring and alerting technologies
- Knowledge of CI/CD systems
- Strong programming skills (Golang, Python, Bash)
- Interest in using and contributing to Open Source tools
Advantageous but not necessary:
- Experience with maintaining / deploying machine learning models in production
What you will gain in return
- An exciting role in a fast growing company, with the opportunity to help shape our SRE approach
- A supportive and collaborative team environment with a commitment to learning and career development
- Share options to align you with the long-term success of the company
- 28 days annual leave
- Access to discounted lunches, gyms, shopping and cinema tickets
- Healthcare Cash Plan and Employee Assistance Programme
- Flexible approach to hybrid-working
- Pension scheme
- Cycle to work scheme
Our interview process is normally a phone interview, a coding task, and 2-3 hours of final interview (carried out virtually). We promise not to ask you any brain teasers or trick questions. We might design a system together on a whiteboard, the same way we often work together, but we won't make you write code on one. Our recruitment process has an average length of 3 weeks.