PayU is a leading financial services provider in global growth markets. We are building our next generation end-to-end PSP solution. Our solutions are based on the most advanced technology that empowers billions of people and millions of merchants to buy and sell online, extending the reach of financial services.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that PayU’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an eye on capacity and performance.

SREs are responsible for the big picture of how our systems relate to each other, we use a wide range of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and break things to proactively identify potential outages are our bread and butter.

SRE's culture of diversity, intellectual curiosity, problem-solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences, and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment.

As a tech lead you are responsible to build and lead an agile team of SREs. You are expected to educate them and other engineers to understand and implement SRE practices, plan and prioritize the team tasks and enable team members’ growth. You will also collaborate with other SRE teams around the world to build our global SRE tools and culture.

Responsibilities

  • Plan and lead day-to-day and future tasks.
  • Train, educate and grow SRE engineers.
  • Engage in and improve the whole lifecycle of services—from inception and design,
    through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Help to maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Educate and practice sustainable incident response and blameless postmortems.
  • Be on an on-call rotation to respond to PayU systems availability incidents and provide support for service engineers with customer incidents
  • Use your on-call shift to prevent incidents from ever happening.
  • Make monitoring and alerting alerts on symptoms and not on outages.
  • Design, build and maintain core infrastructure pieces that allow us scaling to support hundreds of thousands of concurrent payments.
  • Educate engineers on how to approach and debug production issues across services and levels of the stack.
  • Break things in purpose to identify potential outage

You may be a fit for this role if you

  • Think about systems - edge cases, failure modes, behaviors, specific implementations.
  • When you see a manual process you get the itch to automate it.
  • Know your way around Linux and the Unix Shell.
  • Know what is the use of config management systems like Ansible (the one we use)
  • Have strong programming skills - Go/Node.js/Java/Python
  • Have an urge to collaborate and communicate remotely and asynchronously.
  • Have an urge to document all the things so you don't need to learn the same thing twice.
  • When you see something broken, you can't help but fix it.
  • Have an urge for delivering quickly and iterating fast.
  • Have experience with Docker, Kubernetes, Terraform, or similar technologies.
  • Wants to take part in both software and system engineering tasks.

Projects you could work on

  • Coding infrastructure automation with Ansible and Terraform
  • Improving our Prometheus Monitoring or building new Metrics
  • Helping deploy and fix new versions of PayU platforms.
  • Build new ways to prevent production failures and test PayU platforms for resilience and reliability by implementing chaos engineering practices.

Postule a este trabajo

* Obligatorio