Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other Handy production systems running smoothly. SREs are a hybrid of operators and software engineers that leverage engineering principles, operational experience, and automation to our environments. You will help shape our infrastructure and build the foundation our team relies on for the rapid, reliable delivery of our product. We’ll rely on you to instill best practices for building scalable distributed systems, with a keen focus on observability and fault tolerance. Our stack consists of technologies such as Kubernetes, Ruby on Rails, MySQL, Redis, Elasticsearch inside AWS.
We are looking for experienced Site Reliability Engineers who meet the following criteria
- Breadth of knowledge across our infrastructure and application stack.
- Contributes small improvements to all codebase to resolve issues.
- Experience with container orchestration technologies like Kubernetes, Mesos, or Nomad. (We use Kubernetes.)
- A track record of leveraging automation whenever and wherever.
- An appreciation of and enthusiasm for software engineering best practices, such as infrastructure as code, testing, and continuous delivery
- Identifies changes for the product or infrastructure architecture focusing on reliability, performance and availability perspective with a data-driven approach.
- Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources making Handy operate with cost as a discipline.
- Identify parts of the system that do not scale, provide immediate and long term resolution of these incidents.
- Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
Collaboration and Communication:
- Know a domain really well and permeate that knowledge across the rest of the engineering organization.
- Perform and run blameless RCAs on incidents and outages and drive to prevent the incident from reoccurring.
- Show ownership of a major part of the infrastructure.
As an SRE you will:
- Be part of an on-call rotation to respond to incidents and provide support for software engineers across Handy initiative teams.
- Build visibility into SLIs, SLOs, SLAs, dependency graphs to reduce operational burden or toil.
- Drive on instrumentation patterns to alert on symptoms and not on outages leveraging our monitoring stack of Grafana, Prometheus, Elasticsearch.
- Use your on-call shift to prevent incidents from occurring.
- Run our infrastructure with Cloudformation and Kubernetes.
- Use a data-driven approach to findings, turn into repeatable actions and then into automation.
- Improve the deployment process to make it as quick and dependable as possible.
- Design, build and maintain core infrastructure pieces that allow Handy to scale to meet its market demand.
- Debug production issues across the full stack.
- Plan and shape the growth of Handy’s ever-evolving infrastructure.
You may be a fit for this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Have an understanding of large scale system design, monitoring, and operational practices.
- Have strong programming skills - Ruby and/or Go
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have a burning desire for delivering quickly and iterating fast.
- Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies
Projects you could work on:
- Improving our Monitoring stack across the board.
- Migrate our ingress controllers to a more cloud-native paradigm ( istio, envoy, traefik ).
- Instrument our rails app to collect important information about our applications.
- Immutable kubernetes upgrade pattern automation.
- Build tooling to help reduce toil across the engineering organization.
- Competitive salary and equity commensurate with experience and performance
- Full medical, dental, vision package to fit your needs
- Monthly Handy credits
- Unlimited vacation policy; work hard and take time when you need it
- A fun office in the heart of the Flatiron district, always stocked with coffee, snacks and drinks; catered lunch and dinner, foosball, office events and team outings
- Ground floor opportunity with the team
- The rare opportunity to work with sharp, motivated teammates solving some of the most unique challenges and changing the world