Senior Site Reliability Engineer, ML SRE
Company Description
Our client is a global marketplace for unique and creative goods. They build, power, and evolve the tools and technologies that connect millions of entrepreneurs with millions of buyers around the world. As an employee whether you will tackle unique, meaningful, and large-scale problems alongside passionate coworkers, all the while making a rewarding impact and Keeping Commerce Human.
Job Description
The ML SRE team supports the infrastructure and provides a developer-focused, scalable, and reliable infrastructure to develop and deploy ML services seamlessly.
What you’ll be working on
Qualifications:
About the Role
- Design, build and support the core infrastructure used by all ML services, including on-call production support rotations.
- Work cross-functionally with various platform teams, ML teams and product partners to build the next generation of our high availability ML services in the cloud.
- Build and maintain observability and test tooling - logging, monitoring, distributed tracing, alerting and offline test tools needed.
- Practice continuous learning and agile delivery model to stay informed and focused on our deliverables.
- Support GKE services and maintenance that includes software upgrades, performance tuning and GKE cluster tuning and optimization.
- Build GKE Tooling and automate deployments.
About You
- You have solid engineering and coding skills, data structure knowledge and ability to write high performance production quality code.
- You have experience working with languages like Java, Scala, Python, Go or other equivalent languages.
- You are a strong collaborator and communicator and you make the engineers around you grow and learn.
- You have fundamental experience with infrastructure engineering and strong troubleshooting skills.
- You have solid background and hands-on experience with Cloud technologies either Google Cloud or AWS.
- Experience with search technologies such as Lucene/Solr or Elasticsearch is a plus.
- Experience with supporting ML Services is a plus.
- Experience with Kubernetes and Docker is a plus.
- Experience with Unix/Linux operating systems and networking stack (e.g., TCP/IP, routing, network topologies and hardware, SDN) is a plus.
- Experience with Grafana is a plus.