Embedded within the TuSimple Service Infrastructure group, the Technical Lead Manager (TLM), Platform Engineering leads a team of specialized and adept engineers who design, build, and operate/maintain infrastructure services, tools, and libraries using cutting-edge technology. The TLM, Platform Engineering works to deliver operational Artificial Intelligence (AI) platforms at scale and speed. The incumbent helps manage and maintain critical TuSimple infrastructure and is oriented towards automation, eliminating risks, and building reliable, scalable, and performant platforms/systems.
The TLM, Platform Engineering provides daily guidance, coaching and direction to the Simulation and Regen infrastructure team members, ensuring their development, empowerment, and motivation. Others depend on the TLM, Platform Engineering to help accelerate the development cycle of machine learning products. As such, the incumbent uses their deep and broad technical experience to standardize deployments, ensure the auditability of infrastructure, automate various deployment processes, write documentation, and facilitate training events.
They employ a global mindset and excel at collaborating with global teams to resolve obstacles. In addition, they build strong relationships within the team and beyond, by demonstrating appreciation and regard for others’ ideas and work product, and by helping the team consistently deliver results that meet and/or exceed expectations.
What You'll Do:
- Leads the comprehensive process of designing, building, and operating TuSimple’s foundational software services for Regression Testing and Simulation platforms.
- Provides leadership in the recruitment, training and development of top quality engineering talent, ensuring high levels of performance and productivity. Builds morale, motivates and instills productivity and teamwork, creates and promotes a positive and supportive work environment. Creates a culture of continuous improvement for processes, systems, data, training, people, etc.
- Provides highly scalable, reliable, and secure Infrastructure to build distributed applications.
- Develops highly available and fault tolerance systems to achieve 99% service level objectives (SLO’s) for business continuity without downtime or packet loss even during software upgrade.
- Skillfully implements new features and evolves existing infrastructure.
- Imparts knowledge and drives adoption via user guides, Application Programming Interface (API) references, and workshops.
- Design the underlying infrastructure and technical architecture including big data computing, orchestration scheduling, and cloudilization.
- Develops both cloud platforms (AWS and/or alternatives) and on-premise solutions.
- Improve the reliability and continuously develop the platforms responsible for running Simulation and Regression testing services at TuSimple.
- Develops automation deployment and configures a variety of high-performance computing (HPC) architectures and hardware configurations.
- Researches performance across different hardware configurations utilizing HPC clusters, GPU acceleration, low-latency high-traffic networks and other high performance computing configurations on bare metal and cloud infrastructure.
- Designs and implements tooling and automation for clustering, scaling, monitoring, and alerting.
- Standardizes Kubernetes deployments and ensures that infrastructure is auditable.
- Ensures infrastructure security compliance; implements security, permissions, and authentication.
- Assists with recruiting and training initiatives; helps select candidates with strong skills and great potential, mentors junior engineers, and grows the teams’ technical capabilities and capacity.
What You'll Bring:
- Demonstrated ability to lead, inspire, and motivate an engineering team to effectively and efficiently accomplish goals and collaborate. Ability to create a sense of “team” across various locations.
- Advanced capability to design and implement reliable, scalable, and performant distributed systems and data pipelines.
- Familiarity with the whole web stack, including protocols and web server optimization techniques.
- Hands-on experience managing large numbers of diverse systems with configuration management or software delivery platforms.
- Experience with at least one orchestration system (i.e. OpenStack, Kubernetes, Yarn)
- Proficiency in infrastructure as code (IaC) tools like Terraform, Vagrant, Chef, Puppet, or Amazon Web Services (AWS) CloudFormation.
- Demonstrated background and experience in networking; experience with network software, e.g. TCP/IP, IP Tables, routing protocols, etc.
- Demonstrated programming experience with proficiency in Go (Golang), Java, or Python.
- Ability to resolve ambiguity and collect feature requirements and feedback from users.
- Experience with automated deployment and integration tooling.
- Ability to actively collaborate with global teams and resolve obstacles by evaluating all possible solutions and using informed judgement to select the best path forward for the project.
- Experience designing or maintaining the machine learning platform is considered an asset.
- Knowledge of, or experience with, Agile or Scrum project management environments/methodologies is considered an asset.
- Experience with supporting and maintaining networking infrastructure is considered an asset.
- Experience with system administration on AWS is considered an asset.
- Experience with large-scale backend systems and infrastructure is considered an asset.
- Experience with high availability and fault-tolerant systems is considered an asset.
- Previous experience in any of the following areas is considered an asset: infra-level outages, making blameless postmortems, and GPU/CPU scheduling.
- 100% employer-paid healthcare premiums for you and your family
- Work visa sponsorship available
- Relocation assistance available
- Breakfast, lunch, and dinner served every day
- Full kitchens on every floor with unlimited snacks, drinks, special treats, fruits, meals, and more
- Stock options / equity
- Gym membership reimbursement
- Monthly team building budget
- Learning/education budget
- Employer-paid life insurance
- Employer-paid long and short disability
TuSimple is an Equal Opportunity Employer. This company does not discriminate in employment and personnel practices on the basis of race, sex, age, handicap, religion, national origin, or any other basis prohibited by applicable law. Hiring, transferring and promotion practices are performed without regard to the above-listed items.