Vault Health is a leading virtual-first healthcare platform that specializes in delivering remote diagnostics and specialty care to consumers directly, through their employers, and through their local public health agencies. Vault also leverages its virtual platform to facilitate decentralized clinical trials for companies in the Pharmaceutical and Biotech industries. Vault is a leading provider of at-home FDA-approved COVID-19 testing in the U.S., whose solution has been deployed to numerous local and state governments, airlines, universities, professional athletic teams, companies, and organizations. Today, Vault employs more than 500 employees across the country and expects to continue growing as we expand our products and services.
About the Opportunity
We're looking for a Senior Site Reliability Engineer (fully remote) as part of Platform Architecture function. Platform Architecture at Vault helps the entire engineer team iterate faster, ship high-quality products, and scale the platform. SREs are a major part of our Dev-Ops culture in which developers are enabled to build, deliver and own their respective services.
Site Reliability Engineering is a role that is highly skilled at the intersection of Software Engineering and System Engineering for large networked systems. The term was invented originally by Google to address their service availability, and has since become the stand bearer for software engineering teams operating Internet scale, mission critical products requiring 24x7 availability.
You'll embed yourself in the entire software development lifecycle (SDLC), including core developer tasks such as building core architecture-related features, shipping code, creating CI/CD pipelines, and providing education and documentation to other engineers.
You'll also own a majority of the tools and projects for scaling and proactively monitoring, improving the reliability and availability of our systems. These projects include APM and Cloud Infrastructure as Code.
You will report to and work with the CTO and this is 100% remote role.
- Instrument world class APM and observability platform that allows the entire engineering team to attune to deviation in performance
- Create high quality alerts based on business centric performance metric including uptime, error rate, performance baseline, infrastructure load metrics
- Partner with product engineering teams and other SREs to optimize performance and solve issues across the entire stack: hardware, software, application, and network.
- Improve automation and reduce operational toil. We practice Infrastructure as Code so most automation will be written in code
- Actively participate in architecture, design reviews and operational readiness exercises for new and existing services.
- Detect abnormalities in performance and proactively address alerts and deviation to reduce risk to platform before it impacts customer
- You will be part of an on-call rotation consisted of SREs and Engineers but you are not required to solve every infrastructure problem. Our entire engineer team practices Dev-Ops culture and owns their respective services
- Experience with being an SRE or Software Engineer with a keen interest in performance and scalability of large system
- Python and shell scripting language experience
- Experience with APM tools like New Relic, Data Dog and understanding the difference of APM vs Infrastructure monitoring tools is preferred
- Experience with infrastructure as code (Terraform, AWS)
- Experience running services in a large scale environment is a bonus but not required
- Understanding of Linux operating system, networking, and databases
- Knowledge of TCP/IP, HTTP, web application security
- Able to configure or learn to fix network systems including DNS, DHCP, and Load Balancer technologies.
- A degree in computer science is helpful but not required. We value skills and technical aptitude over degree
- You are a software engineer who is skilled in solving problems with code
- You are familiar with myriad services and solution architecture using AWS
- You have a deep interest in understanding systems and application design
- You love to learn, especially about how software interacts with networked systems including polyglot databases, data pipeline, messaging, and third party services.
- Network computing at scale is a novel problem for most engineering teams.
- Since not many companies encounter the scale that Vault already reached, we do not expect you to know everything about the entire tech stack.
- As an engineering organization, we value and cultivate a culture of learning and development to enable everyone to grow and learn
- You are a hard worker and are adaptable to a startup environment
- You are able to work independently as part of our 100% remote team
Vault Health is an equal opportunity employer. All applicants will receive consideration for employment without regard to race, color, religion, sex, gender identity, national origin, age, disability, or veteran status.