Embedded within the TuSimple Service Infrastructure group, the Senior Platform Engineer II is a specialized and adept engineer who designs, builds, and operates/maintains infrastructure services, tools, and libraries using cutting-edge technology. They are oriented towards eliminating risks and building reliable, scalable, and performant systems.
Others depend on the Senior Platform Engineer II to help accelerate the development cycle of machine learning products. As such, the incumbent uses their deep and broad technical experience to standardize deployments, ensure the auditability of infrastructure, automate various deployment processes, write documentation, and facilitate training events.
From diving into technical details for critical decisions, to holding workshops to impart knowledge and drive adoption, the Senior Platform Engineer II has exceptional communication skills and understands how their work impacts others, including end users. They employ a global mindset and excel at collaborating with global teams to resolve obstacles. In addition, they build strong relationships within the Platform Engineering team and beyond, by demonstrating appreciation and regard for others’ ideas and work product, and by helping the team consistently deliver results that meet and/or exceed expectations.
What You'll Do:
- Participates in designing, building, and operating TuSimple’s foundational software services and platforms.
- Skillfully implements new features and evolves existing infrastructure.
- Imparts knowledge and drives adoption via user guides, Application Programming Interface (API) references, and workshops.
- Develops both cloud platform (AWS and/or alternatives) and on-premise solutions.
- Supports and maintains core Artificial Intelligence (AI) infrastructure at TuSimple.
- Designs and implements tooling and automation for clustering, scaling, monitoring, and alerting.
- Standardizes Kubernetes deployments and ensures that infrastructure is auditable.
- Assists with recruiting and training initiatives; helps select candidates with strong skills and great potential, mentors junior engineers, and grows the teams’ technical capabilities and capacity.
- Ensures infrastructure security compliance; implements security, permissions, and authentication.
- Leads containerization and deployment of microservices on Kubernetes.
- Establishes and monitors application access and connectivity.
- Auto scales and monitors performance for Kubernetes. Runs applications using Prometheus and Grafana, or similar tools
- Performs Site Reliability Engineering (SRE) activities such as: availability and reliability monitoring and reports.
- Sets up infrastructure as a service (IaaS) using Terraforms.
- Establishes and contributes to the continuous integration/continuous delivery (CI/CD) processes.
- Establishes and operates code repository with Github Enterprise.
- Builds and maintains strong relationships across the organization.
What You'll Bring:
- Advanced capability to design and implement reliable, scalable, and performant distributed systems and data pipelines.
- Proficiency in modern container technologies including Docker and Kubernetes.
- Proficiency in infrastructure as code (IaC) tools like Terraform, Vagrant, Chef, Puppet, or Amazon Web Services (AWS) CloudFormation.
- Proficient in Go (Golang) and Python.
- Ability to resolve ambiguity and collect feature requirements and feedback from users.
- Working knowledge of, and experience with, large scale distributed software and systems.
- Ability to actively collaborate with global teams and resolve obstacles by evaluating all possible solutions and using informed judgement to select the best path forward for the project.
- Demonstrated experience with security and access management.
- Ability to clearly and succinctly communicate verbally and in writing, in both technical and non-technical English.
- Ability to identify, troubleshoot, and resolve issues quickly and effectively.
- Strong attention to detail and documentation skills.
- Experience designing or maintaining the machine learning platform is considered an asset.
- Knowledge of, or experience with, Agile or Scrum project management environments/methodologies is considered an asset.
- Experience with supporting and maintaining networking infrastructure is considered an asset.
- Experience with system administration on AWS is considered an asset.
- Experience with large-scale backend systems and infrastructure is considered an asset.
- Experience with high availability and fault-tolerant systems is considered an asset.
- Previous experience in any of the following areas is considered an asset: infra-level outages, making blameless postmortems, and GPU/CPU scheduling.
- Global mindset; excels at collaborating with global teams and resolving obstacles by evaluating all possible solutions and selecting the best path forward for the project.
- Oriented towards eliminating risks and building reliable, scalable and performant systems.
- Strong interpersonal skills, underpinned by genuine and transparent communication, as well as appreciation and regard for others’ ideas and work product.
- High sense of urgency; self-starter, highly responsive, and able to work and deliver in a fast-paced stream-alignment environment.
- Driven to learn. Committed to keeping current with best practices and emerging industry trends in a quickly evolving sector.
- Intellectually curious with a strong bias to action.
- Driven to understand and collaborate with multiple stakeholders, able understand and interpret stakeholder needs - translating needs into clear objectives.
- Analytical, judgment, persuasion and consensus building abilities where there are competing interests.
- Strong oral, and written communication skills. Capable of listening and obtaining clarification, changing approach or method to best fit the situation. Able to effectively partner with cross-functional teams to coordinate activities and accomplish goals.
- Demonstrated experience building and improving processes and promoting quality.
- Ability to work independently with limited required direction and guidance.
- Confident in making technical recommendations to senior management.
- Strong organizational skills, ability to coordinate multiple tasks and support projects of varying complexity in parallel within tight deadlines.
- Proven ability to work independently in a matrix organization, tech start-up experience preferred.
- Ability to maintain resilience throughout aggressive deadlines, changing priorities, and evolving operations, as common to progresive start-up environments.
- 100% employer-paid healthcare premiums for you and your family
- Work visa sponsorship available
- Relocation assistance available
- Breakfast, lunch, and dinner served every day
- Full kitchens on every floor with unlimited snacks, drinks, special treats, fruits, meals, and more
- Stock options / equity
- Gym membership reimbursement
- Monthly team building budget
- Learning/education budget
- Employer-paid life insurance
- Employer-paid long and short disability
TuSimple is an Equal Opportunity Employer. This company does not discriminate in employment and personnel practices on the basis of race, sex, age, handicap, religion, national origin, or any other basis prohibited by applicable law. Hiring, transferring and promotion practices are performed without regard to the above-listed items.