LearnUpon is looking for a Staff Site Reliability Engineer to join our team in Ireland.
LearnUpon LMS helps organizations train their employees, partners, and customers. Businesses can manage, track, and achieve their unique learning goals — all through a single, powerful solution.
With offices in Dublin (our HQ), Philadelphia, Belgrade, and Sydney, we are a team that puts our customers' experience at the heart of everything we do. We're always striving for the best solution (not the easy one), and we go the extra mile to deliver work we're proud of.
Our culture fosters open, collaborative environments where our team and individual accomplishments are celebrated and encouraged. At LearnUpon, where we work together as a friendly, supportive team who, most importantly, like to have fun.
About the Team:
The SRE Team sits within LearnUpon’s Engineering group. We are a small team focused on maintaining and expanding our cloud infrastructure and app services, to ensure platform scalability and site availability as we look to grow threefold over the next few years. We are key consultants for the entire company on matters of infrastructure management and observability.
About the Role:
As a Staff SRE you will be a key technical leader and role model within both SRE and the overall Engineering organization. You will drive an SRE approach to the work we do and help define the future vision of SRE within LearnUpon. You will have a say in everything from build to the provisioning and maintenance of our cloud Infrastructure as well as reliability, observability, monitoring and metrics of our platform.
You will be responsible for designing, implementing and maintaining a highly available and scalable infrastructure with the primary focus on planning ahead for our future as we look to transition towards a containerized environment.
While our tool stack is predominantly AWS, Terraform, Ansible and Packer, we welcome anyone with substantial experience with similar technologies to be part of our journey. We prefer choosing the right technology for the right problem so you’ll have plenty of space to grow your skills.
What will I be doing?
As a Staff Site Reliability Engineer you will be part of the team responsible for the scale-out of the LearnUpon infrastructure. The main responsibilities are:
- Identifying opportunities to improve and scale our infrastructure for performance, observability, maintainability, and cost, by creating innovative solutions with a strong emphasis on infrastructure as code.
- Lead our efforts to build an observability function that incorporates application metrics, application transaction tracking, and event log management.
- Lead and plan, from an SRE perspective, our transformative initiative of moving LearnUpon production infrastructure to a containerised environment, orchestrated by Kubernetes.
- Working with other Engineering teams to provide infrastructure solutions that meet their ongoing requirements while also looking at future capacity needs.
- Building tools focused on measuring, monitoring and alerting, with an eye towards self-service in order to promote Engineers’ ownership of observability.
- Reacting quickly to changing customer and business needs.
- Mentoring junior talent.
- Participate in on-call rota.
What skills do I need?
- 7+ years of experience working with SaaS products at scale within an SRE/DevOps role.
- 5+ years of cloud engineering experience, with at least 2 years experience with AWS.
- Strong experience with implementing infrastructure as code (e.g. CloudFormation, Terraform etc.), automation tooling (e.g. Puppet, Ansible etc.), CI/CD (e.g. Jenkins, Travis CI, GitLab etc.)
- Experience in designing and implementing observability tech stacks using tools such as Grafana, Prometheus, Datadog, New Relic etc.
- Experience deploying microservice environments, using containerisation technologies such as Docker and Kubernetes.
- Ability to design an SLO/SLI implementation that balances the needs of different teams.
- Experience building and supporting large-scale distributed systems that back a consumer app or website with associated requirements of performance, security and disaster recovery.
- Familiar with cost analysis of observability metrics gathering, engineering effort, and tooling.
- Able to effectively communicate technical ideas to and collaborate with both technical and non-technical peers.
- Experience with database scaling would be a strong plus.
Don’t worry if you don’t tick every box in order to apply, we’re always happy to review applications and take all experience into consideration. We do our best to provide feedback where we can!
Why work with us?
- Work in a fun and supportive environment with regular team events.
- Excellent career progression - take LearnUpon where you think it can go.
- Structured learning environment.
- Competitive salary and company ESOP.
- Employer Contributed Pension.
- Private health insurance.
- 25 days annual leave + 1 Company day off.
- Flexible Working Arrangements.
What is the Hiring Process?
Applicants for the position can expect the following hiring process:
- Qualified applicants will be invited to schedule a 30-minute call.
- Successful candidates will then be invited to a series of practical interviews.
- Finally, candidates will have a short interview with our CEO/CTO.
- Successful candidates will be contacted with an offer to join our team.
LearnUpon is an Equal Opportunities Employer. We do not discriminate on the basis of gender, marital status, family status, age disability, sexual orientation, race, religion, membership of the Traveller community, or any other legally protected status.