Manage, Deploy, and Monitor your models in one place.
The Verta Model Delivery & Operations Platform takes any data science or machine learning (ML) model and instantaneously packages and delivers it by using best-in-class DevOps support for CI/CD, operations, and monitoring. Automate increased availability, scaling and safe deploys to simplify your workflow and get more out of your AI-ML data.
As a Site Reliability Engineering you will be the owner of the performance and reliability of Verta's infrastructure, robustness of the deployment pipeline, as well as timely and effective incident response and resolution. You will take responsibility for the growth and stability of Verta's infrastructure and be responsible for driving effective incident response.
- 5+ years of experience in Site Reliability or Tools Engineering
- 3+ years of coding experience (Python, Java or Go)
- Experience with Kubernetes and operators
- Experience with AWS and Terraform
Exciting things you can work on:
- Provide observability and help ensure robustness of the Verta platform and workloads running on it. We’re helping our customers solve ML problems and running their code/analysis on top of our system, which comes with its own set of challenges that you don’t get with regular product building/non-platforms. When considering how things run, you have to consider the interaction between your internal product and things that the user has built themselves!
- Increase developer productivity by implementing CI/CD best practices, automation frequently done operations and integrating the tools to reduce the burden on our developers. We are strong believers on delivering products fast and safely, which requires establishing the necessary automation to create the guardrails for developers.
- Automate compliance guidelines and enforce processes. Machine learning touches sensitive data and our customers trust us to properly handle their workloads and data. The scope ranges from defining and implementing our security posture to implementing the bleeding edge security practices for the ecosystem we use (e.g. Kubernetes).
- Build automation and operators to facilitate running and developing Verta everywhere. As a platform, we have to run under a wide range of circumstances, both for customer and developer needs. Relying only on manual processes makes the system brittle, so we rely heavily on automation to ensure the system is behaving the way we want it to and automatically fixing when something goes wrong.
- Expand the locations and configurations under which Verta can operate. Machine learning provides a rare combination of privacy concerns, high computation and data transfer, which means that we need to run specific workloads in specific locations/configurations (e.g. European data needs to remain in Europe). This provides a challenge on how to build a platform that is able to run on these different constraints simultaneously, while providing a central view and control of everything that is going on.
Our customers are typically internal engineering and external IT.