Database Reliability Engineers (DBRE) are responsible for keeping database systems that support all user-facing services (most notably GitLab.com) and many other GitLab production systems running smoothly 24/7/365. DBREs are a blend of database engineering and administration gearheads and software crafters that apply sound engineering principles, operational discipline and mature automation, specializing in databases (PostgreSQL in particular). In that capacity, DBREs are peers to SREs and bring database expertise to the SRE and SAE Infrastructure teams as well as our engineering teams.
GitLab.com is a unique site and it brings unique challenges: it’s the biggest GitLab instance in existence; in fact, it’s one of the largest single-tenancy open-source SaaS sites on the internet. The experience of our team feeds back into other engineering groups within the company, as well as to GitLab customers running self-managed installations
As a DBRE you will:
- Work on database reliability and performance aspects for GitLab.com from within the SRE team as well as work on shipping solutions with the product.
- Analyze solutions and implement best practices for our main PostgreSQL database cluster and its components.
- Work on observability of relevant database metrics and make sure we reach our database objectives.
- Work with peer SREs to roll out changes to our production environment and help mitigate database-related production incidents.
- OnCall support on rotation with the team.
- Provide database expertise to engineering teams (for example through reviews of database migrations, queries and performance optimizations).
- Work on automation of database infrastructure and help engineering succeed by providing self-service tools.
- Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible.
- Plan the growth of GitLab's database infrastructure.
- Design, build and maintain core database infrastructure pieces that allow GitLab to scale to support hundreds of thousands of concurrent users.
- Support and debug database production issues across services and levels of the stack.
- Make monitoring and alerting alert on symptoms and not on outages.
- Document every action so your learnings turn into repeatable actions and then into automation.
You may be a fit to this role if you:
- Have at least 5 years of experience running PostgreSQL in large production environments
- Have at least 2 years of experience with infrastructure automation and configuration management (Chef, Ansible, Puppet, Terraform…)
- Have at least 3 years of experience with any object oriented programming language in a software engineering role
- Have experience with Ruby on Rails, Django, other Ruby and/or Python web frameworks, or Go
- Have strong programming skills
- Have solid understanding of SQL and PL/pgSQL
- Have solid understanding of the internals of PostgreSQL
- Have experience working in a distributed production environment
- Share our values, and work in accordance with those values.
- Have excellent written and verbal English communication skills
- Have an urge to collaborate and communicate asynchronously.
- Have an urge to document all the things so you don't need to learn the same thing twice.
- Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have an urge for delivering quickly and iterating fast.
- Know your way around Linux and the Unix Shell.
- Have the ability to orchestrate and automate complex administrative tasks. Knowledge in config management systems like Chef (the one we use)
- Passion for stable and secure systems management practices
- Strong data modeling and data structure design skills
Projects you could work on:
- Review, analyze and implement solutions regarding database administration (e.g., backups, performance tuning)
- Work with Terraform, Chef and other tools to build mature automation (automatic setup new replicas or testing and monitoring of backups).
- Implement self-service tools for our engineers using GitLab ChatOps.
- Provide technical assistance and support to other teams on database and database-related application design methodologies, system resources, application tuning.
- Review database related changes from engineering teams (e.g., database migrations).
- Recommend query and schema changes to optimize the performance of database queries.
- Jump on a production incident to mitigate database-related issues on GitLab.com.
- Participate actively in the infrastructure design and scalability considerations focusing on data storage aspects.
- Make sure we know how to take the next step to scale the database.
- Design and develop specifications for future database requirements including enhancements, upgrades, and capacity planning; evaluate alternatives; and make appropriate recommendations.