Clearwater Analytics is looking for a technical expert in building out and refining the DevOps discipline within an enterprise SaaS environment, someone who is focused on continuous integration, continuous deployment and promoting the productivity of all of Engineering. As a Site Reliability Engineer, you will be building, evolving, and operating the infrastructure automation platform used to power our Services. You will need to ensure that our production environment is operating and performing optimized and efficiently; and that software is released and deployed in an efficient and streamlined manner, from development all the way to production. This is a hands-on operational role with a balanced amount of tool and infrastructure development, including advanced scripting and automation. You will be supporting our systems/ security/ tools/ processes infrastructure – on premises or external cloud, and support the entire stack for our service offering.
- Design, develop and deliver software to improve the availability, scalability, latency, and efficiency of Clearwater Analytics’ services.
- Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
- Work with engineering to deploy and operate cloud services and related projects from development to production
- Maintenance and configuration of internal applications and a number of data security and data management processes.
- Influence and create new designs, architectures, standards and methods for large-scale distributed systems.
- Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
- Conduct periodic on call duties.
- Assertiveness & creative ideas are mandatory.
- Excellent interpersonal skills suitable for user support, including the ability to lead projects with peer-level engineers/managers.
- Exceptional communication skills – both written and oral (one-on-one and group).
- Strong analytical, problem-solving, and decision-making skills.
- Must have self-starting personality, unafraid to display initiative and innovation on the job.
- Experience in one or more of: Java or scripting experience in Shell and Python
- Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols
- Understanding of Automation tools: Jenkins, Gitlab, Artifactory, Jira
- Networking: experience with network theory e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing
- Knowledge on Cloud-based services - preferably AWS including but not limited to: EC2, S3, Cloudfront, EBS, SQS, etc.
- Knowledge of configuration management tools to help you manage software and system changes repeatedly and predictably (Puppet, Chef, Docker, Salt, Automated Testing).
- Experience in analysis and building metrics gathering systems (Newrelic, Datadog, AppDynamics, etc)
- MS in Computer Science or a related field OR BS (4 yr) in Computer Science or a related field with 3+ years of experience directly related to the duties and responsibilities specified.
- Expertise in designing, analyzing and troubleshooting large-scale distributed systems.
- In-depth knowledge of operating systems (processes, threads, concurrency issues, locks, mutexes, semaphores, monitors and how they work).
- Familiarity with algorithms, data structures and complexity analysis.
- Systematic problem solving approach, coupled with a strong sense of ownership and drive.
- Ability to manage multiple projects with competing priorities.
- A track record of maintaining and improving skills in existing and emerging open source technologies through training or self-research.
- Love to learn, enjoy troubleshooting and thinking through complicated problems.
- Ability to manage time effectively in a fast-paced, customer-focused, changing environment.
- Comfortable with collaboration, open communication and reaching across functional borders.
- Willingness and ability to maintain a positive, quality-oriented, reliable, and flexible attitude.
- Willingness and ability to do what it takes to achieve objectives, including off-hours support or tasks.