Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. Affirm, Inc. proudly includes Affirm, PayBright, and Returnly.
Affirm’s Infrastructure Platform team is building a large-scale, massively distributed, fault-tolerant global infrastructure shared across multiple financial products, merchants and vendors. Ensuring that our infrastructure is openly available to engineers is a critical part of Affirm’s success story. We pride ourselves on our culture across engineering design, architecture and writing detailed tech specs and capturing feedback before large changes to systems.
We are looking for a Staff Site Reliability Engineer with deep technical knowledge and who’s passionate about Linux, networking topics, microservices and distributed architectures and has experience with handling large scale services to join our Site Reliability Engineering team. Our goal is to enable Affirm's global, service oriented architecture based product and infrastructure stack to be observable, highly resilient, scalable and fault tolerant, while maintaining our high SLA uptime expectations. You will excel if you have passion for digging deep, and a flare for sharp technical communication, prioritization, and organization. You will work directly with our Platform / Infrastructure and Product Development teams to build our next generation “always up” cloud-based platform.
Our work ranges from Observability/Telemetry Engineering, Reliability and Scalability Engineering, Chaos Engineering, Performance Engineering, Capacity Engineering and Disaster Recovery Engineering, and working closely with the security team on managing application level security.
Site Reliability Engineers are hybrid System, Software, Data and Network Engineers who are responsible and accountable to build and scale reliable systems that impresses our customers.
What you'll do
- Own end to end availability, reliability and performance of the mission critical services
- Troubleshoot various issues around reliability, resiliency, scalability and availability.
- Define and measure SLI, SLA and SLO
- Augment instrumentation to build a cohesive dependency mapping with special attention to points of failure
- Build command and control automations to quickly fail away to reduce TTR and reduce manual work/eliminate Toil.
- Assist with oncall and triage rotation
What we look for
- Linux, Networking and AWS experience
- Experience with containerization and container platforms. (e.g., Docker, Kubernetes)
- Familiarity with Elasticsearch, Kibana/Grafana, Logstash, kafka and ways to scale these systems
- Experience with automation systems (ansible, puppet, terraform) is a plus, saltstack preferred
- Experience with open source systems a plus
- Software development experience in Python/Kotlin/Go is a plus
- Experience with high performance networking (Quic, network layer optimization) or Real Time transaction protocols/methods (HTTP2, Server Sent Events, MQTT, WebSockets).
- Recommends or helps architect an entire system. Acts as an expert in understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
Location - Remote U.S.
Affirm is proud to be a remote-first company! The majority of our roles are remote and can be located anywhere in the U.S. and Canada (with the exception of the U.S. Territories, Quebec, Yukon, Nunavut, and the Northwest Territories) unless the job indicates a different global location. We are currently building operations in Spain, Poland, and Australia. Employees in remote roles have the option of working remotely or from an Affirm office in their country of hire, and may occasionally travel to an Affirm office or elsewhere for required meetings or team-building events. Our offices in Chicago, New York, Pittsburgh, Salt Lake City, San Francisco and Toronto will remain operational and accessible for anyone to use on a voluntary basis, subject to local COVID-19 guidelines.
At Affirm, People Come First is one of our core values, and that’s why diversity and inclusion are vital to our priorities as an equal opportunity employer. You can read about our D&I program here and our progress thus far in our 2020 DEI Report.
We also believe It’s On Us to provide an inclusive interview experience for all, including people with disabilities. We are happy to provide reasonable accommodations to candidates in need of individualized support during the hiring process.