We're on a mission to simplify the everyday lives of consumers. We believe post-purchase is a critical phase of the customer journey. That's why we created Narvar - a platform focused on driving customer loyalty through seamless post-purchase experiences that allow retailers to retain, engage, and delight customers. If you've ever bought something online, there's a good chance you've used our platform!
From the hottest new direct-to-consumer companies to retail’s most renowned brands, Narvar works with Patagonia, GameStop, Neiman Marcus, Sonos, Nike and 650+ other brands. With offices in San Francisco, London, Paris, and Bangalore, we've served over 125 million consumers worldwide across 7 billion interactions, 38 countries, and 55 languages.
Pioneering the post-purchase movement means navigating into the unknown. Our team thrives on this sense of adventure while nurturing a mindset of innovation. We're a home for big hearts and we leave our egos at the door. We work hard but we always make time to celebrate professional wins, baby showers, birthday parties, and everything in between.
We are looking for a principal site reliability engineer to lead cloud ops & data infrastructure for all of the Narvar products. You will lead reliability, scalability & availability of our overall infrastructure with an eye towards automation - optimizing for a reduction in MTTR & operational cost.
In your first year, you will define and execute a roadmap across the engineering organization to utilize fully automated, self-service, highly scalable, cost-efficient, observable, auditable, and reliable infrastructure services as standard practice. You will evaluate, propose, and drive large improvements to production systems uptime to achieve a significant impact on our business bottom line and engineering team efficiency.
To accomplish all of this, you will partner with stakeholders ranging from executives to engineers across the organization.
- Provide expert technical guidance and ongoing engineering design review to teams planning and implementing large migrations, broad architectural shifts, and capacity growth
- Build a metrics-driven operational culture standardizing our practices for SLO definition and review, logging, monitoring, alerting, and on-call practices
- Make iterative improvements to blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies
- Partner closely with Security, Quality, and Product teams to achieve high priority security, privacy, compliance, reliability and business-continuity objectives to the overall product roadmap
What we’re looking for
- You have proven hands-on technical leadership experience demonstrating business impact
- You have software engineering and systems engineering skills
- You have deep technical experience with various technologies that include AWS/GCP, Linux, Docker, Jenkins, Kubernetes, Prometheus, ELK, Grafana, (Cassandra, Yugabyte, Redis, MongoDB, etc.), (Kafka, Pulsar, Elasticsearch, etc.), and service-oriented architecture
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.