The Site Reliability Engineer will be responsible for ensuring the performance, stability, and reliability of mission-critical cloud infrastructure and services, focusing on automation, monitoring, incident response, and collaboration with development teams to enhance system scalability and availability.
Key Responsibilities
Ensure performance and stability of mission-critical infrastructure and production services across a global environment
Establish and maintain service reliability through monitoring, incident response, and root cause analysis
Design and implement automation and orchestration solutions to improve operational efficiency
Collaborate with development teams to integrate SRE principles and improve service architecture for high availability and scalability
Build and enhance observability tools for system health monitoring, metrics collection, and alerting systems
Drive adoption of modern cloud operations technologies, including Infrastructure as Code, container orchestration, and high-availability solutions
Requirements
Demonstrated ability to write production-quality code using languages such as Python, Go, Java, C, or C++, including experience with software design, implementation, and maintenance.
At least 3 years of experience as a Site Reliability Engineer (SRE) or DevOps engineer supporting globally distributed SaaS services.
Systematic and data-driven problem-solving approach, coupled with strong communication skills and a deep sense of ownership for critical production services.
A solid understanding of Enterprise Systems performance analysis and debugging, with the ability to leverage metrics and data to drive system improvements.
Ability to establish and maintain service reliability for core cloud platforms and infrastructure by implementing monitoring, incident response, root cause analysis (RCA), and resolution for production issues in a 24x7 environment.
Experience in transforming operational practices by designing and implementing automation and orchestration solutions for manual cloud service operations and deployment to enhance efficiency and reduce human error.
Experience partnering cross-functionally with development teams to integrate SRE principles early in the development lifecycle, including defining improvements to service architecture that support high availability, scalability, and adherence to SLAs.
Experience building and evolving observability stacks by setting up, configuring, and improving service health monitoring, collecting and reporting key metrics, and establishing alerting systems.
Experience driving adoption of modern cloud operations technologies, including exploring and integrating new tools for Infrastructure as Code (IaC), container orchestration, and high-availability (HA) architectures.
Ability and willingness to work from the Prague office in compliance with company policies, unless on PTO, work travel, or other approved leave.
Benefits & Perks
Compensation/salary range (not specified in the posting)
Work schedule: Flexible time off
Work environment perks: Wellness resources, company-sponsored team events
Additional benefits: Accommodations for disabilities, inclusive and diverse work culture, opportunities for growth and development
Ready to Apply?
Join Pure Storage and make an impact in renewable energy