Job Description
Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. As a Manager of Chaos Engineering, you and your team will work to Improve the resiliency of services across the company by reducing critical incident response time (CIRT) and increasing successful incident resolution through defining policies and procedures, providing education and consultation, and methodically injecting fault into the system. You enjoy creating solutions to operations problems. You have a holistic knowledge of our systems and services and can re-engineer processes when they need it and then clearly communicate the necessary change. You understand how various development teams operate and can reduce their effort to deliver new services. You have the grace to stay calm when production services are down and the courage to ask for help from the right people as needed to bring them back up. You enjoy collaborating with people from other teams and disciplines to make plans a reality.
Responsibilities
- Manage a team of Site Reliability Engineers to increase the resiliency of all systems at Magic Leap.
- Work with other technical leaders to ensure that their systems meet Magic Leap standards for reliability.
- Follow the scientific method when evaluating and testing systems for faults.
- Provide product management when evaluating and implementing new systems and services
- Develop, evaluate, and report on the team’s success against Key Performance Indicators (KPIs)
- Create a positive culture, hire effectively, and keep the team happy and healthy
- Operates independently and leads other engineers though complex projects and tasks.
- Supports and develops colleagues by providing advice and coaching
- Participates in rotating on-call duties in a 24x7x365, team
- Help facilitate and act as a key stakeholder in company retrospectives to ensure the safe and swift remediation of any incidents
- Develops solutions to increase service stability through automation and process re-engineering
- Builds and supports tools and systems that the enterprise will use to deploy their software into production
- Updates job knowledge by studying state-of-the-art tools and techniques; participating in educational opportunities; reading professional publications; maintaining personal networks; participating in professional organizations
Qualifications
- Experience working with incident response systems such as PagerDuty, Statuspage, OpsGenie, etc.
- Sound fundamentals in UNIX based systems including proficiency with UNIX tools like SSH, grep, sed, awk, find, etc.
- A solid understanding of networking and core Internet protocols (e.g., TCP/IP, DNS, TLS, SMTP, HTTP)
- Strong programming skills in a modern language. Go, Java, Node.js, Ruby, etc.
- Ability to script in a shell language (Bash or POSIX Shell)
- Experience with public cloud providers (AWS, Google Cloud Platform, etc.)
- Experience working with containers (Docker, Kubernetes, ECS, etc.)
- Comfort with frequent, incremental code testing and deployment
- Strong grasp of automation tools (Terraform, Jenkins, Concourse CI, Bitbucket Pipelines, etc.)
- Comfort with collaboration, open communication and reaching across functional borders
- Ability to remain calm under pressure and take command of a recovery effort.
- Minimum of ten years experience working in a software engineering, operations, or development role
Education
- BA/BS in Computer Science or equivalent experience
- MBA or equivalent experience preferred
Additional Information
- All your information will be kept confidential according to Equal Employment Opportunities guidelines.
#LI-LS1