A Staff Site Reliability Engineer at Redwood Materials is responsible for designing, implementing, and maintaining highly available and scalable systems, automating infrastructure management, and ensuring system resilience to support the company's rapid growth in battery recycling and energy storage solutions.
Key Responsibilities
Collect business and technical requirements to establish system objectives and SLOs
Design and implement highly available, scalable hybrid on-premise systems using platform technologies
Coordinate cross-functional teams to ensure system solutions meet business needs
Automate deployment and management of IT infrastructure to improve efficiency and recovery times
Develop integrations for data visibility and system utility
Support deployed systems by responding to incidents, troubleshooting, and participating in on-call rotations
Lead post-incident reviews and implement improvements to prevent recurrence
Requirements
Bachelor's degree in information technology or any related field.
At least 2 years of experience in an SRE (Site Reliability Engineering) related role.
At least 5 years of experience in an IT Systems related role.
Experience administering IT Infrastructure such as VMware, Active Directory, Windows Server, Linux, Networking, Cloud Infrastructure including AWS and Azure, and Load balancing.
Expertise in scripting, coding, automation, and integration with tools such as Python, Ansible, Chef, Puppet, REST, YAML, JSON.
Ability to collect business technical requirements and work with cross-functional teams to establish Service Level Objectives (SLOs).
Ability to design effective on-premise hybrid systems solutions with high availability and scalability, utilizing platform technologies including vSphere, Kubernetes, Linux, and Windows.
Ability to coordinate work across IT, Software, Industrial Controls, Engineering Business teams to implement complete systems that meet business needs.
Ability to identify opportunities to automate deployment management of IT infrastructure systems to reduce manual efforts and speed recovery.
Ability to develop integrations that streamline data visibility across components to deliver complete, efficient, and user-friendly systems.
Ability to support deployed systems by responding to incidents, leading fast triage, troubleshooting issues, and participating in an on-call rotation.
Experience with post-incident reviews and driving improvements to eliminate repeat failure modes.
Physical ability to perform essential job functions safely and successfully in environments that may include exposure to noise, dust, chemicals, and temperature extremes, while protected by PPE.
Physical ability to perform essential physical requirements such as climbing, standing, stooping, or typing.
Ability to maintain regular, punctual attendance consistent with ADA, FMLA, and other federal, state, and local standards.
Willingness to work in challenging conditions, including occasional work on weekends, nights, or being on-call, and occasional travel.