PDT is building a Software Reliability Engineering organization to harden our production systems and automate our technical operations. We recognize that highly reliable and resilient systems are crucial to the firm’s continued success. We are searching for SREs who will work with dev teams collaboratively and creatively to help us identify, measure, and meet our service-level goals.
We’re hiring an SRE who will embed within the Trading Infrastructure team and help drive investments in monitoring, deployment automation, test frameworks, and incident response. They will identify and test new failure scenarios that drive the design and development of more resilient systems. They will develop best practices for planning and managing changes to our real-time trading systems.
Why join us? PDT Partners has a stellar twenty-six-year track record and a reputation for excellence. Our goal is to be the best quantitative investment manager in the world—measured by the quality of our products, not their size. PDT’s extremely high employee-retention rate speaks for itself. Our people are intellectually exceptional, and our community is close-knit, down-to-earth, and diverse.
Responsibilities
Engage with application developers throughout the full development life-cycle – from inception and design to deployment, operation, and iterative development. Drive conversations around system resiliency and observability.
Help identify, triage, and automate systems maintenance toil. Evolve systems by pushing for change that improve reliability and developer velocity.
Drive adoption of new technology platforms among application teams at PDT, including Kubernetes and Prometheus.
Identify and drive investments in load, integration testing, and chaos testing.
Help run our production trading systems day-to-day. Software Reliability Engineers at PDT are not first-level responders, but we expect them to be involved in incident response so that they’re exposed to the maintenance costs of the system and helping reduce them over time.
Help develop robust organizational practices around monitoring, alerting, testing, deployment, and incident response.
Help identify key uptime and performance metrics for our production systems. Define and track SLOs for each.
Qualifications
4+ years of in a Software Engineer, DevOps, or Site Reliability Engineering role.
2+ years of experience working with a public cloud offering (preferably AWS)
Mastery of at least one compiled programming language. Experience with C++ is a plus.
Mastery of at least one scripting programming language. Experience with Python is a plus.
Mastery of at least one production configuration management tool and one cloud-based infrastructure-as-code tool. Experience with Kubernetes is a plus.
Past experience working in the “embedded SRE” organizational model.
Creative and collaborative mindset.
Excellent written and verbal communication skills.
Education
Bachelor’s or Master’s degree in Computer Science, engineering, or related field from a rigorous academic program