The role involves ensuring the reliability and performance of Celonis' SaaS platform by improving monitoring, automating incident prevention, and driving a culture of Site Reliability Engineering (SRE) across the organization.
Key Responsibilities
Improve monitoring and metrics for all Celonis services
Define and implement missing Service Level Objectives (SLOs)
Develop processes and automations to prevent problem recurrence and document knowledge
Champion reliability and promote a culture of Site Reliability Engineering (SRE) within the organization
Own and enhance the incident management process and facilitate blameless lessons learned
Share knowledge across teams and collaborate to engineer reliable and resilient services
Requirements
Solid experience within the SRE domain and an excellent background in software engineering typically 10 years
Outstanding communication and collaboration skills
Programming knowledge covering Java and Spring Boot
Experience working with large scale distributed systems
In-depth and hands-on knowledge of Kubernetes
Experience with major cloud providers AWS and Azure
Experience with monitoring and observability solutions, e.g., Datadog
Benefits & Perks
compensation/salary range not specified
hybrid working options
company equity RSUs
generous PTO
extensive parental leave
dedicated volunteer days
access to gym subsidies
counseling and well-being programs
clear career paths
internal mobility
dedicated learning program
mentorship opportunities
community and support through inclusion and belonging programs
Ready to Apply?
Join Celonis and make an impact in renewable energy