CircleCI is seeking a Senior Site Reliability Engineer to work closely with our Software Engineers to deliver and manage the high-performance and scalable infrastructure underlying our multi-tenant Cloud offering. You will not only have the chance to automate and optimize infrastructure through the construction of appropriate tooling, but you will help software engineers through the design phase to optimize their services for scale in our production /environment.
You'll join a highly distributed team building features and services designed specifically for macOS and iOS developers. You'll write sustainable, resilient code as part of an engineering organization that values collaboration, trust, and learning. You’ll be part of a team at the heart of CircleCI’s business responsible for build environments used by thousands of development teams every day.
About this role:
- Design and deliver solutions to improve the availability, scalability, latency, and efficiency of CircleCI’s services
- Foster a culture of observability and monitoring; helping your team use operational data to improve the stability and performance of our systems
- Diagnose and resolve production issues in conjunction with software engineering teams
- Implement shared infrastructure used by all services and teams within the CircleCI platform
- Support and advise software engineering teams in the design of scalable services
- Build and maintain tools for deployment, monitoring, and debugging
- Execute disaster recovery drills
- Participate in rotating on-call duties, including incident management
- Proficiency in one or more of: Go, Java, Python, C or C++, Clojure
- Experience working with Docker, Kubernetes, Terraform, Helm, AWS, and modern distributed SaaS infrastructure.
- Knowledge of virtualization technologies, such as VMware or KVM
- Understanding of standard networking protocols and components such as: TCP/IP, HTTP, DNS, ICMP, VLANs, the OSI Model, IP Subnetting, and Load Balancing
- Knowledge of operating systems (processes, threads, IPC, concurrency, locks, mutexes, semaphores, etc.)
- Understanding of good monitoring and alerting practices, using tools like Datadog and Pagerduty
- Knowledge of the internal workings of at least one of: PostgreSQL, MongoDB, Redis
- Focus on security in the delivery of all levels of a system
- Passion for modern software development and operation, including agile, CI/CD, and infrastructure-as-code
- Desire to learn and grow career as a Site Reliability Engineer
- 2 or more years of experience
Work remotely with our globally distributed team!
We’re a distributed company with teammates across the world. For this role, we are hiring engineers to work remotely in the United States and through our affiliate, Continuous Labs, in the following Canadian provinces: Alberta, British Columbia, Manitoba, New Brunswick, Newfoundland and Labrador, Nova Scotia, Ontario, Prince Edward Island and Saskatchewan.
CircleCI is the world’s largest shared continuous integration and continuous delivery (CI/CD) platform, and the central hub where code moves from idea to delivery. As one of the most-used DevOps tools that processes more than 1 million builds a day, CircleCI has unique access to data on how engineering teams work, and how their code runs. Companies like Spotify, Coinbase, Stitch Fix, and BuzzFeed use us to improve engineering team productivity, release better products, and get to market faster.
CircleCI is proud to be an Equal Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law.