The role involves leading reliability engineering efforts for a cloud-based microservices platform, ensuring system performance, resilience, and scalability through automation, monitoring, and incident management practices.
Key Responsibilities
Ensure the health, performance, and resilience of the platform using SRE principles
Lead reliability efforts for microservices on Kubernetes, including observability, automation, and incident prevention
Develop and enforce SLOs, SLAs, and error budgets to drive reliability
Own high-priority application incident escalations, perform technical analysis, and restore services within SLOs
Automate manual processes to improve availability, latency, and performance of production services
Collaborate with engineering teams to conduct post-incident reviews and implement systemic reliability improvements
Requirements
Bachelor's or Master's degree in Computer Science, Software Engineering, or a related technical field or equivalent hands-on experience.
Minimum of 8 years of experience in software engineering or Site Reliability Engineering (SRE) roles.
Deep experience with cloud platforms such as AWS, GCP, or Azure.
Proficiency in Java, the Spring framework, and Python or a similar scripting language in a Linux environment.
Prior experience contributing to Site Reliability Engineering initiatives or similar operational roles.
Demonstrated ability to lead projects and influence engineering culture.
Knowledge of SRE principles, including SLI, SLO design, error budgets, and toil reduction strategies.
Excellent written and verbal communication skills in English.
Experience with developing and operating production-grade, scalable services using Kubernetes and elastic cloud architectures.
Experience with CI/CD pipelines and tools such as ArgoCD, GitHub Actions, or similar.
Experience with Infrastructure as Code (IaC) tools such as Terraform and Kustomize.
Ownership of high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs.
Ability to develop and enforce SLOs, SLAs, and error budgets to drive reliability-focused development.
Ability to engineer solutions to enhance the availability, latency, and performance of production services, automating manual processes to eliminate toil and scale operational efficiency.
Benefits & Perks
Total compensation package including base salary, bonus, commission, equity, benefits, health, dental, life, 401k, and paid time off
Hybrid working options
Generous paid time off (PTO)
Company equity RSUs
Extensive parental leave
Dedicated volunteer days
Access to gym subsidies
Counseling and well-being programs
Clear career paths and internal mobility
Dedicated learning programs and mentorship opportunities
Community and support through inclusion and belonging programs
Ready to Apply?
Join Celonis and make an impact in renewable energy