Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that PayU’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an eye on capacity and performance.
SREs are responsible for the big picture of how our systems relate to each other, we use a wide range of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and break things to proactively identify potential outages are our bread and butter.
SRE's culture of diversity, intellectual curiosity, problem-solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences, and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment.
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Help to maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Educate and practice sustainable incident response and blameless postmortems.
- Be on an on-call rotation to respond to PayU systems availability incidents and provide support for service engineers with customer incidents.
- Use your on-call shift to prevent incidents from ever happening.
- Make monitoring and alerting alerts on symptoms and not on outages.
- Design, build and maintain core infrastructure pieces that allow us scaling to support hundreds of thousands of concurrent payments.
- Educate engineers on how to approach and debug production issues across services and levels of the stack.
- Break things in purpose to identify potential outages
You may be a fit for this role if you:
- Think about systems - edge cases, failure modes, behaviours, specific implementations.
- When you see a manual process you get the itch to automate it.
- Know your way around Linux and the Unix Shell.
- Know what is the use of config management systems like Ansible (the one we use)
- Have strong programming skills - Go/Node.js/Java/Python
- Have an urge to collaborate and communicate remotely and asynchronously.
- Have an urge to document all the things so you don't need to learn the same thing twice.
- When you see something broken, you can't help but fix it.
- Have an urge for delivering quickly and iterating fast.
- Have experience with Docker, Kubernetes, Terraform, or similar technologies.
- Wants to take part in both software and system engineering tasks.
Projects you could work on:
- Coding infrastructure automation with Ansible and Terraform
- Improving our Prometheus Monitoring or building new Metrics
- Helping deploy and fix new versions of PayU platforms.
- Build new ways to prevent production failures and test PayU platforms for resilience and reliability by implementing chaos engineering practices
ZOOZ , a PayU company, provides a payments platform designed to help merchants improve and optimize their payments activities. We help merchants reduce costs, increase conversions, fight fraud and expand globally.
We are an enterprise payments platform that allows easy connectivity to multiple providers globally while leveraging data to optimize transactions.
The Engineering group at ZOOZ brings together Software design, Infrastructure Management/Design and Operations and Engineering.