A leadership role overseeing reliability, scalability, and operational excellence for Everpure Cloud's platform, involving managing SRE and platform teams, developing core cloud infrastructure, and ensuring high performance and resilience of cloud services.
Key Responsibilities
Lead and develop SRE and Platform teams, setting strategy and execution for reliability, scalability, and operability of Everpure Cloud.
Own reliability engineering by defining and evolving SLIs, SLOs, error budgets, incident response, change management, and runbooks.
Build and operate internal platform tooling for modern developer workflows, including CI/CD, observability, telemetry, and automation.
Operate and harden core cloud infrastructure such as Kubernetes and IaC across control and data planes.
Lead capacity planning, cost optimization, disaster recovery, and multi-region readiness.
Proven leadership experience in running SRE, Production Engineering, and Platform functions for SaaS or cloud services at scale, including building high performance, inclusive teams.
Hands-on software development experience with fluency in engineering fundamentals, including design reviews, automated testing, CI/CD, and version control, with the ability to contribute to production-grade code.
Deep knowledge of SRE foundations such as SLIs, SLOs, error budgets, incident management, capacity planning, change release management, and reliability reviews.
Practical cloud expertise, with a preference for Azure, including experience with modern SRE toolchain components such as containers, Kubernetes, Infrastructure as Code (Terraform, Bicep, CloudFormation), CI/CD pipelines, and observability tools like OpenTelemetry, Prometheus, Grafana, ELK, and Azure Monitor.
Experience operating and hardening core cloud infrastructure including Kubernetes and Infrastructure as Code (IaC) across control and data planes.
Experience leading capacity planning, cost optimization, disaster recovery, and multi-region readiness for cloud services.
Strong systems thinking and architectural skills, including resilience reviews, failure mode analysis, chaos engineering, disaster recovery testing, and data-driven stakeholder communication.
Ability to define and evolve SLIs, SLOs, error budgets, and operational excellence practices such as on-call management, incident response, change management, and runbook development.
Experience championing incident management, continuous improvement, blameless postmortems, and systematic toil reduction to achieve Mean Time to Recovery (MTTR) improvements.
Work authorization and willingness to work onsite at the Prague office in accordance with company policies, unless on PTO, work travel, or approved leave.
Benefits & Perks
Competitive relocation package
Flexible time off
Wellness resources
Company-sponsored team events
Ready to Apply?
Join Pure Storage and make an impact in renewable energy