A Site Reliability Engineer (SRE) responsible for enhancing the monitoring, automation, and incident management processes to ensure the reliability and resilience of Celonis's SaaS platform, leveraging software engineering and cloud technologies.
Key Responsibilities
Improve monitoring and metrics for all Celonis services and define/implement missing SLOs
Implement processes and automations to prevent problem recurrence and document knowledge
Champion reliability and promote SRE culture within the organization
Own and enhance the incident management process and facilitate blameless lessons learned
Share knowledge across teams and collaborate to engineer reliable and resilient services
Requirements
Solid experience within the SRE domain and an excellent background in software engineering typically 10 years
Outstanding communication and collaboration skills
Programming knowledge covering Java and Spring Boot
Experience working with large scale distributed systems
In-depth and hands-on knowledge of Kubernetes
Experience with major cloud providers such as AWS and Azure
Experience with monitoring and observability solutions, e.g., Datadog
Benefits & Perks
generous PTO
hybrid working options
company equity RSUs
comprehensive benefits
extensive parental leave
dedicated volunteer days
access to resources such as gym subsidies, counseling, and well-being programs
clear career paths
internal mobility
dedicated learning program
mentorship opportunities
Ready to Apply?
Join Celonis and make an impact in renewable energy