XMotors.ai develops autonomous self-driving features for the Xiaopeng EV vehicles in China market. We are looking for people passionate about self-driving vehicles to join in and help us build top class autonomous self-driving vehicles.
We are looking for DevOps Engineer, who will work closely with machine learning engineers and maintain our AI infrastructure.
Responsibilities:
Manage cloud and on-premises multiple site GPU clusters, including GPU server setup, upgrade and maintenance.
Manage the container based training environment, including Nvidia driver installation/upgrade, containerized training environment set up.
Daily support for GPU servers access and training data management.
Setup and maintain CI/CD for AI infra upgrade and auto deployment.
Requirements:
Bachelors or above in Computer Engineering, Computer Science, Information Systems, or a related field with 2+ years of relative work experience
Hands-on knowledge Docker, Slurm, Kubernetes, Cloud, and related technologies.
Familiar with Nvidia GPU cards, related deep learning toolkit installation and setup.
Familiar with scripting (e.g. shell, Python), Linux system administration (e.g. Ubuntu, CentOS).
Familiar with Ansible automation tools.
Familiar with monitoring tools and dashboard, like Prometheus, Grafana.
Prefer to have knowledge regarding nccl and cuda.
What do we provide:
A fun, supportive and engaging environment.
Opportunities to pursue and work on cutting edge technologies
Competitive salary
Snacks, lunches and fun activities
We are an Equal Opportunity Employer. It is our policy to provide equal employment opportunities to all qualified persons without regard to race, age, color, sex, sexual orientation, religion, national origin, disability, veteran status or marital status or any other proscribed category set forth in federal or state regulations.