The SRE Team at DKatalis is a 24/7 operation in charge of maintaining the digital platform that serves Bank Jago’s system and services. You'll experience a mix of activities from optimizing Kubernetes to maintaining systems uptime, to debugging production issues and running runbooks to mitigate potential production issues. One of the key objectives of the SRE teams is to constantly improve and uphold the reliability of our digital platform, our software release such as deployment processes and automation of recurring tasks. You should have a strong software engineering background and have the opportunity to collaborate with various software squads in building up a reliable digital platform.
Technologies We Use:
- Cloudflare, Google Cloud - GKE, Kubernetes, Tyk API Gateway
- Gitlab, Terraform
- NodeJS, Java, Redis, Mongo, Kafka,
- Dynatrace, Pagerduty
Role and Responsibilities:
- Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
- Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
- You have a strong systems experience with good coding practice.
- You have an analytical approach to identifying problem components based on data points. Reliability of systems & applications is your core passion.
- The team will be responsible for analyzing systems based on data points to identify workloads that are critical to the business.
- Comfortable working cross-functionally to ensure success of the system’s operation. You will be closely collaborating with other engineering and product teams to ensure that expected system behavior is understood and monitoring exists to detect anomalies.
- Lead in-depth technical and data analysis to gauge service trends and drive improvements.
- You are comfortable with on-call responsibility and are able to manage a crisis working with the broader team, communicating progress and challenges during the crisis.
- Participate in continuous improvement and execution of quality and timely major incident root cause analysis and blameless post-mortem activities to ensure we take action to avoid similar problems in the future.
- Contribute to prioritization of reliability features and contribute to the design, development and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
- Contribute to proactive technical communication of reliability, stability and efficiency results (based on Service Level Objectives), service health (via dashboards) key reliability risks and issues to senior business and technology stakeholders – to prioritize activity (based on trend analysis) and direct investment and action.
- You are either a Software Engineer with real interest, and ideally some experience in Linux systems, networking, monitoring and automation; or an experienced sysadmin or systems engineer with professional skills in Linux, preferably on distributed systems at scale, and a demonstrable interest and experience in using software engineering to solve operational problems.
- Comfortable writing software to automate API-driven tasks at scale. Cloud Tooling engineers primarily use NodeJS and /or Java and Go are also key languages in our environment.
- Experience automating the build and deployment of software products, and understand the related challenges in distributed systems.
- Excellent communication (both verbal and written). The ability to communicate confidently and clearly on conference calls, in meetings, via email, etc. at all levels of the organization is essential.
- Ability to quickly and clearly communicate incident status via email in business friendly language
- 8+ years experience in software development and/or SRE functions with at least 3 years in a senior/lead capacity
- Degree in Computer Science, Engineering, or equivalent experience.
- Experience and advanced understanding of Observability, CI/CD and release management.
- Well-rounded broad knowledge of OS platforms (Linux/UNIX), Networking, Web Systems and Dev Ops
- Experience working with large-scale distributed systems with understanding of microservices architecture concepts
- Strong organizational skills and the ability to effectively manage multiple tasks simultaneously
- Capable of working in a complex, fast paced environment and ability to maintain calm during stressful situations