The role involves owning the reliability of a global customer fleet by developing innovative software and AI solutions to analyze, diagnose, and predict system health, ensuring high uptime and resolving complex technical issues in a distributed systems environment.
Key Responsibilities
Lead investigations and develop engineering solutions for emergent issues in the customer fleet.
Acquire a comprehensive understanding of Pure Storage's full software stack.
Develop services and AI tools for diagnosing and protecting the fleet.
Design analytics to monitor system health and predict potential problems.
Represent customers and collaborate with engineering teams to resolve complex product issues.
Participate in on-call rotations to respond to customer outages and improve system uptime.
Requirements
3-6 years relevant work experience in Site Reliability Engineering (SRE) or similar roles with large production environments.
Ability to analyze complex systems and describe them in simple terms.
Proficiency in programming languages including Go, Java, Python, C, and Linux.
Experience leading investigations and engineering solutions for emergent issues.
Experience acquiring a broad and deep understanding of software stacks.
Experience developing services and AI tools to diagnose and protect fleet systems.
Experience designing next-generation analytics to monitor system health and anticipate problems.
Ability to represent customers and engineering teams in presenting complex product issues and developing solutions.
Ability to participate in an on-call rotation to rapidly respond to customer outages worldwide and improve system uptime.
Benefits & Perks
Flexible time off
Wellness resources
Company-sponsored team events
Ready to Apply?
Join Pure Storage and make an impact in renewable energy