A Staff Site Reliability Engineer at Redwood Materials is responsible for designing, implementing, and maintaining highly available and scalable systems, automating infrastructure management, and ensuring system resilience to support the company's rapid growth and sustainability goals.
Key Responsibilities
Collect business and technical requirements to establish system objectives and SLOs
Design and implement highly available, scalable hybrid on-premise systems using platform technologies like vSphere, Kubernetes, Linux, and Windows
Coordinate with cross-functional teams to ensure systems meet business needs
Automate deployment and management of IT infrastructure to improve efficiency and recovery speed
Develop integrations to enhance data visibility and system utility
Support deployed systems by responding to incidents, troubleshooting issues, and participating in on-call rotations
Lead post-incident reviews and implement improvements to prevent recurrence
Requirements
Bachelor's degree in information technology or any related field.
At least 2 years of experience in an SRE (Site Reliability Engineering) related role.
At least 5 years of experience in an IT Systems related role.
Experience administering IT Infrastructure such as VMware, Active Directory, Windows Server, Linux, Networking, Cloud Infrastructure including AWS and Azure, and Load balancing.
Expertise in scripting, coding, automation, and integration with tools such as Python, Ansible, Chef, Puppet, REST, YAML, JSON, etc.
Ability to collect business technical requirements and work with cross-functional teams to establish Service Level Objectives (SLOs).
Ability to design effective on-premise hybrid systems solutions with high availability and scalability, utilizing platform technologies including vSphere, Kubernetes, Linux, and Windows.
Experience supporting deployed systems by responding to incidents, leading fast triage, troubleshooting issues, and participating in an on-call rotation.
Ability to lead post-incident reviews and drive improvements to eliminate repeat failure modes.
Experience working with SCADA, OT, MES, or other industrial related software systems is preferred.
Experience with disaster recovery (DR) playbooks, capacity modeling, and cost performance optimization in hybrid environments.
Self-motivated, hands-on mindset, with a willingness to contribute at all levels.
Physical ability to perform essential job functions safely and successfully, including climbing, standing, stooping, or typing, consistent with ADA, FMLA, and other standards.
Ability to maintain regular, punctual attendance consistent with ADA, FMLA, and other standards.
Ability to work in challenging conditions which may include exposure to noise, dust, chemicals, and temperature extremes, while protected by PPE, for extended periods.
Availability to work occasional weekends, nights, or be on-call as a regular part of the job.