About the Team
We handle tens of millions of customers each day playing our games which turns into quite the infrastructure challenge. We receive billions of requests, among API calls and RPCs, and we serve billions of queries in our databases as well. This means we process many terabytes of logs and other data. All the while our growth continues meaning these numbers aren’t stagnant and the challenges keep mounting. As we continue to grow, scale and reliability are evermore tantamount to the success of the company and support of our gamers.
The SRE team is one of our newer parts of this Infrastructure team. We have a core team in Sao Paulo but as we look to grow and scale we wanted to build out the team, both in terms of leadership but also geography.
About the Role
Our Irish office is looking for someone who wants autonomy for making game-changing changes to our infrastructure and who can deal with a really high scale and dedicated to the reliability of our end-to-end infrastructure. You will roll up your sleeves and dive into the issues that affect reliability be it on systems, software, process, and operational levels. You will be relentless in the search for how to best automate to free up time for our SREs and software engineers to focus on the next big challenges. But also for how best to playbook and create seamless processes for handling all sorts of challenges (cost & efficiency, migrations, integrations & rollouts, outages, etc).
As this is a management position for a team that doesn't exist yet, you will have full autonomy to build it. First and foremost will be to source, attract, and staff the local team as well as to help onboard and integrate them with the Sao Paulo SRE team and broader engineering organisation. You will need to be an experienced manager who has built out a team and know how to effectively staff a team given limited resources and creative planning. Second, you will need to design and implement the model for SRE engagement working with engineering leadership on assimilation and support. This will then open up to developing a set of shared processes and playbooks by which not only SRE org but all of engineering org will need to think about and design for scale, reliability, and performance.
More about you
- You have solid experience with Postgres, Redis, Cassandra and MongoDB databases on-prem or in the cloud
- You have a solid understanding of systems and application design, including the operational trade-offs of various designs
- You have advanced knowledge of scripting and languages
- You have an expert understanding of Linux systems, services, optimization, storage subsystems, and file systems
- You have solid experience with cluster management systems (Kubernetes, Mesos) and configuration management software, like Salt.
- You know how network services (DNS, TLS/SSL, HTTP) and network fundamentals (DHCP, subnetting, routing, firewalls, IPv6, BGP) work
- You have strong experience designing and managing multi-tenant database solutions (PostgreSQL)
- You are confident in your knowledge with load balancers (Nginx, HAProxy)
- You have proven ability to collaborate and affect change with multiple stakeholders and organizations
- You are comfortable making tradeoffs and knowing what to prioritize
- You can navigate constraints and like finding creative solutions in ambiguous environments
- You have excellent written and social communication and documentation skills
What you'll do
- Build and maintain a good relationship and coordination with our SRE team in Brazil
- Understand our whole and highly distributed stack
- You will work closely with engineering teams to design, build, and maintain systems helping them with database use, schema design and query tuning
- Manage and develop large, cutting-edge clusters, using and implementing innovative technologies
- Troubleshoot issues and look at our systems with an eye toward scalable and efficient architectures
- Create a shared understanding within SRE and engineering on how to approach reliability (from monitoring to troubleshooting to redesign) and performance optimization
- Serve in and design 24x7 on-call services for our major systems
- Represent SRE to engineering leadership and push for ever-increasing automation, coaching and influencing other managers
- Hire at least 6 other engineers
- You will also help guide, mentor and train team members; helping to enhance our infrastructure and grow our team
What you'll need
- Years of experience: You have at least 5 years managing a team
- You have a minimum of 5 years experience handling services in a large scale environment
- Bachelor's Degree or Master's degree in a technical field such as Computer Science, Information Technology Engineering or equivalent work experience
- Fluent English is a requirement
- Brazilian Portuguese is a plus
- Interest in Gaming also a plus
We welcome people from all backgrounds who seek the opportunity to help build the best gaming company, where everyone thrives