The role involves designing, building, and maintaining large-scale machine learning infrastructure to support safety and operational AI features across various industries, with a focus on scalability, reliability, and cross-team collaboration.
Key Responsibilities
Design, build, and operate the end-to-end machine learning platform for training, experimentation, inference, and edge deployment.
Partner with product and ML teams to develop and improve ML-powered features that enhance safety, reliability, and cost efficiency.
Lead capacity planning and throughput estimation for new ML features from exploration to production.
Collaborate on experiment design, evaluation, and interpretation to inform product and technical decisions.
Evolve and standardize infrastructure for training, experimentation, and deployment workflows.
Design and operate scalable online and batch inference systems, ensuring observability and reliability.
Work with firmware and edge teams to package, validate, and deploy models to devices, establishing feedback loops for continuous improvement.
Ensure the reliability, security, and observability of ML systems across cloud and edge environments.
Provide technical leadership by setting architecture strategies, influencing cross-team decisions, and mentoring engineers and scientists.
Requirements
10 years of overall experience in machine learning engineering or related fields, with a strong track record of building and operating large-scale ML systems
Strong experience with distributed computing frameworks such as Ray and/or Spark
Hands-on experience with cloud infrastructure including AWS, containers such as Kubernetes, and production observability tooling
Proven experience building or supporting ML platforms used by multiple teams for training, experimentation, or inference
Solid understanding of ML fundamentals including evaluation, experiment design, and model iteration in production environments
Experience shipping ML-powered features end-to-end, from design through production and iteration, with measurable impact on product or business metrics
Background in computer vision and/or LLM-based systems in production environments
Experience with edge or on-device ML and collaboration with firmware or embedded teams
Familiarity with model lifecycle systems including model registry, deployment, monitoring, rollback, and drift detection
Experience working in environments with strong security and compliance requirements
Demonstrated ability to lead across teams and influence technical direction at Staff scope
A strong sense of ownership and a desire for end-to-end autonomy from platform design to real-world impact
Benefits & Perks
Annual Base Salary 196,000 - 269,500 CAD
Performance-based bonus variable pay
Equity with no vesting cliff and ongoing refresh opportunities
Flexible, employee-led remote work model
Comprehensive health and parental leave plans
Professional development stipend
Ready to Apply?
Join Samsara and make an impact in renewable energy