SRE's with Environment Automation specialization primarily focus on provisioning of various GitLab environments, and automating every operational aspect of the application lifecycle. They have a strong operational background, but their strength is in converting regular manual actions into repeatable automated tasks.

  1. Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.
  2. Responding to platform emergencies, alerts, and escalations from Customer Support.
  3. Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort.
  4. Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns.
  5. Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimise their resource consumption.
  • Be on a PagerDuty rotation to respond to GitLab.com availability incidents and provide support for service engineers with customer incidents.
  • Use your on-call shift to prevent incidents from ever happening.
  • Run our infrastructure with Chef, Terraform and Kubernetes.
  • Make monitoring and alerting alert on symptoms and not on outages.
  • Document every action so your findings turn into repeatable actions–and then into automation.
  • Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible
  • Improve the deployment process to make it as boring as possible.
  • Design, build and maintain core infrastructure pieces that allow GitLab scaling to support hundred of thousands of concurrent users.
  • Debug production issues across services and levels of the stack.
  • Plan the growth of GitLab's infrastructure.
  • Think about systems - edge cases, failure modes, behaviors, specific implementations.
  • Know your way around Linux and the Unix Shell.
  • Know what is the use of config management systems like Chef (the one we use)
  • Have strong programming skills - Ruby and/or Go
  • Have an urge to collaborate and communicate asynchronously.
  • Have an urge to document all the things so you don't need to learn the same thing twice.
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
  • Have an urge for delivering quickly and iterating fast.
  • Share our values, and work in accordance with those values.
  • Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies
  • Ability to use GitLab
  • Coding infrastructure automation with Chef and Terraform
  • Improving our Prometheus Monitoring or building new Metrics
  • Helping release managers deploy and fix new versions of GitLab-EE.
  • Plan, prepare for, and execute the migration of GitLab.com from virtual machines running on Google Cloud to cloud-native container-based deployments with Kubernetes using Google Kubernetes Engine.
  • Develop a relationship with a product group, define their SLAs, share GitLab.com data on those SLAs and improve their reliability

Leveling of Site Reliability Engineering at GitLab

  • Use Chef and Ansible to efficiently manage our infrastructure
  • Implement "Infrastructure as Code" using Terraform and GitLab CI/CD for automation
  • Load balancing the application including Proxies and CDN
  • Kubernetes and containerizing our system
  • Administer a high-availability PostgreSQL cluster.
  • Monitoring and Metrics in Prometheus, Grafana and integrations with Slack/PagerDuty
  • Logging infrastructure
  • Backend storage management and scaling
  • Disaster Recovery and High Availability strategy
  • Contributing to code in GitLab
  • Team organization and planning
  • Issue, Epic, OKR leadership and completion
  • Creating blog posts
  • Completing Root Cause Analysis (RCA) investigations
  • Contributions to handbook, runbooks, general documentation
  • Leading and contributing to designs for issues, epics, okrs
  • Improving team practices in handoffs of work and incidents
  • Involvement in hiring process - reviewing questionnaires, involved in interviews, qualifying candidates
  • Knowledge sharing, mentoring
  • Accountability, Self awareness, handling conflict in the team and receiving feedback
  • Maintaining good relationships with other engineering teams in GitLab that help improve the product

The Junior Site Reliability Engineer is a grade 5.

Technical:

  1. Updates GitLab default values so there is no need for configuration by customers.
  2. General knowledge of the 2 of the areas of technical expertise

Execution:

  1. Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
  2. Delivers production solutions that scale, identifies automation points, and proposes ideas on how to improve efficiency.
  3. Improves monitoring and alerting fighting alert spam.

Collaboration and Communication:

  1. Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what.

Influence and Maturity

  1. Shares the learnings publicly, either by creating issues that provide context for anyone to understand it or by writing blog posts.

The Site Reliability Engineer is a grade 6.

Technical:

  1. General knowledge of the 4 of the areas of technical expertise with deep knowledge in 1 area

Execution:

  1. Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
  2. Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
  3. Plan, design and execute solutions within infrastructure team to reach specific goals agreed within the team.
  4. Plan and execute configuration change operations both at the application and the infrastructure level.
  5. Actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation

Collaboration and Communication:

  1. Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what.

Influence and Maturity

  1. Shares the learnings publicly, either by creating issues that provide context for anyone to understand it or by writing blog posts.
  2. Contributes to the hiring process in review questionnaires or being part of the interview team to qualify SRE candidates

The Senior Site Reliability Engineer is a grade 7.

Are experienced Site Reliability Engineers who meet the following criteria

Technical:

  1. Deep knowledge in 2 areas of expertise and general knowledge of all areas of expertise. Capable of mentoring Junior in all areas and other SRE in their area of deep knowledge.
  2. Contributes small improvements to the GitLab codebase to resolve issues

Execution:

  1. Identifies significant projects that result in substantial cost savings or revenue
  2. Identifies changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
  3. Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make GitLab cheaper to run for all our customers.
  4. Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
  5. Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication:

  1. Know a domain really well and radiate that knowledge through recorded demos, discussions in DNA meetings, or Incident Reviews
  2. Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

Influence and Maturity:

  1. Set an example for team of SREs with positive and inclusive leadership and discussion on work.
  2. Show ownership of a major part of the infrastructure.
  3. Trusted to de-escalate conflicts inside the team

Country Hiring Guidelines

Please visit our Country Hiring Guidelines page to see where we can hire.


Your Privacy

For information about our privacy practices in the recruitment process, please visit our Recruitment Privacy Policy page.

 

Apply for this Job

* Required
  
  


U.S. Equal Opportunity Employment Information (Completion is voluntary)

Individuals seeking employment at GitLab are considered without regards to race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, veteran status, gender identity, or sexual orientation. You are being given the opportunity to provide the following information in order to help us comply with federal and state Equal Employment Opportunity/Affirmative Action record keeping, reporting, and other legal requirements.

Completion of the form is entirely voluntary. Whatever your decision, it will not be considered in the hiring process or thereafter. Any information that you do provide will be recorded and maintained in a confidential file.

Race & Ethnicity Definitions

If you believe you belong to any of the categories of protected veterans listed below, please indicate by making the appropriate selection. As a government contractor subject to Vietnam Era Veterans Readjustment Assistance Act (VEVRAA), we request this information in order to measure the effectiveness of the outreach and positive recruitment efforts we undertake pursuant to VEVRAA. Classification of protected categories is as follows:

A "disabled veteran" is one of the following: a veteran of the U.S. military, ground, naval or air service who is entitled to compensation (or who but for the receipt of military retired pay would be entitled to compensation) under laws administered by the Secretary of Veterans Affairs; or a person who was discharged or released from active duty because of a service-connected disability.

A "recently separated veteran" means any veteran during the three-year period beginning on the date of such veteran's discharge or release from active duty in the U.S. military, ground, naval, or air service.

An "active duty wartime or campaign badge veteran" means a veteran who served on active duty in the U.S. military, ground, naval or air service during a war, or in a campaign or expedition for which a campaign badge has been authorized under the laws administered by the Department of Defense.

An "Armed forces service medal veteran" means a veteran who, while serving on active duty in the U.S. military, ground, naval or air service, participated in a United States military operation for which an Armed Forces service medal was awarded pursuant to Executive Order 12985.


Form CC-305

OMB Control Number 1250-0005

Expires 05/31/2023

Voluntary Self-Identification of Disability

Why are you being asked to complete this form?

We are a federal contractor or subcontractor required by law to provide equal employment opportunity to qualified people with disabilities. We are also required to measure our progress toward having at least 7% of our workforce be individuals with disabilities. To do this, we must ask applicants and employees if they have a disability or have ever had a disability. Because a person may become disabled at any time, we ask all of our employees to update their information at least every five years.

Identifying yourself as an individual with a disability is voluntary, and we hope that you will choose to do so. Your answer will be maintained confidentially and not be seen by selecting officials or anyone else involved in making personnel decisions. Completing the form will not negatively impact you in any way, regardless of whether you have self-identified in the past. For more information about this form or the equal employment obligations of federal contractors under Section 503 of the Rehabilitation Act, visit the U.S. Department of Labor’s Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

How do you know if you have a disability?

You are considered to have a disability if you have a physical or mental impairment or medical condition that substantially limits a major life activity, or if you have a history or record of such an impairment or medical condition.

Disabilities include, but are not limited to:

  • Autism
  • Autoimmune disorder, for example, lupus, fibromyalgia, rheumatoid arthritis, or HIV/AIDS
  • Blind or low vision
  • Cancer
  • Cardiovascular or heart disease
  • Celiac disease
  • Cerebral palsy
  • Deaf or hard of hearing
  • Depression or anxiety
  • Diabetes
  • Epilepsy
  • Gastrointestinal disorders, for example, Crohn's Disease, or irritable bowel syndrome
  • Intellectual disability
  • Missing limbs or partially missing limbs
  • Nervous system condition for example, migraine headaches, Parkinson’s disease, or Multiple sclerosis (MS)
  • Psychiatric condition, for example, bipolar disorder, schizophrenia, PTSD, or major depression

1Section 503 of the Rehabilitation Act of 1973, as amended. For more information about this form or the equal employment obligations of Federal contractors, visit the U.S. Department of Labor's Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

PUBLIC BURDEN STATEMENT: According to the Paperwork Reduction Act of 1995 no persons are required to respond to a collection of information unless such collection displays a valid OMB control number. This survey should take about 5 minutes to complete.