NextGenEnergyJobsRenewable Energy Jobs
CompaniesCitiesIndustries

NextGenEnergyJobs

The #1 platform for renewable energy careers. Join thousands of professionals who've found their dream jobs in renewable energy, sustainability, and renewable tech.

0+Newsletter subscribers
25K+Jobs posted
100+Companies

Sustainability Partners

Sustainability Software DirectoryRefurbished Tech Guide

Find Jobs

  • All Jobs
  • By Location
  • By State
  • International
  • By Industry
  • Top Companies
  • Job Titles

Job Types

  • Remote Jobs
  • Hybrid Jobs
  • Full-time
  • Part-time
  • Contract
  • Internships
  • Visa Sponsored

Experience

  • Entry Level
  • Mid Level
  • Senior Level
  • Executive
  • Remote Internships

Resources

  • Career Advice Hub
  • Top 10 Jobs
  • Solar Sales Salary
  • Become Solar Engineer
  • Salary Insights
  • CV Analyzer
  • Post a Job

Popular Job Locations

San Francisco
245 jobs
Boston
189 jobs
Denver
167 jobs
Austin
143 jobs
New York
298 jobs
Chicago
132 jobs
Seattle
201 jobs
Portland
98 jobs
Los Angeles
176 jobs
San Diego
87 jobs
Washington DC
203 jobs
Atlanta
112 jobs

Hot Remote Specializations

Project ManagerSolar SalesCustomer SuccessData EntryAll Data Entry
© 2026 NextGenEnergyJobs. All rights reserved.
Privacy PolicyTerms of ServiceAbout UsContact
  1. Home
  2. Jobs
  3. Senior Cloud Infrastructure Engineer
Gatik logo

Senior Cloud Infrastructure Engineer

Gatik
Mountain View, Canada
Full Time
Posted May 27, 2026
$180k - $240k
Power Generation
~23 people viewed this recently
Apply Now

Application opens on company website

Job Description

At Gatik, we connect people of extraordinary talent and experience to an opportunity to create a more resilient supply chain and contribute to our environment’s sustainability.

Key Responsibilities

We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective. This role is onsite 5 days a week at our Mountain View, CA office! • Cloud-Native Orchestration & Kubernetes Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads. GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization. Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools. Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts. • Data Engineering & CI/CD Pipelines Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats. GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts. Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry. Agentic DevOps & CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry. • Model Management & Lifecycle (MLOps) Experiment & Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration. Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation. High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime. • Distributed Training & ML Systems Support Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod. Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training. Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU clusters for FSDP and DeepSpeed workloads.

Requirements

• Experience: 5+ years in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments. • Kubernetes Mastery: Deep expertise in K8s, Helm, and container orchestration. • Orchestration & Tooling: Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform. • Distributed Systems: Practical experience supporting frameworks like Ray and PyTorch Distributed. • Core Skills: Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC. • Distributed Training Expertise: Deep understanding of FSDP, and DeepSpeed. • AI Agent Orchestration: Experience building Agentic Workflows (LangGraph, AutoGen) for infrastructure automation or data curation. • Advanced Protocols: Familiarity with Model Context Protocol (MCP) to connect AI agents with infrastructure tools. Salary Range - $180,000- $240,000

Ready to Apply?

Join Gatik and make an impact in renewable energy

Apply Now

Stay Updated on Sustainability Jobs

Get the latest renewable energy jobs and career tips delivered to your inbox.

Job Alerts

Get notified about new sustainability jobs

More at Gatik

Chase Vehicle Operator

Springdale$0k

Verification & Validation (V&V) Engineer – HIL, Simulation & Autonomy Validation

Mountain View$220k

Verification & Validation (V&V) Deployment Engineer

Mountain View$220k

Jobs in Mountain View, Canada

NPI Alignment Engineer

Aeva$179k

Senior Software Engineer, Autonomy Visualization

Nuro$291k

Software Engineer, Data Platform

Nuro$241k

More jobs at Gatik

Gatik logo

Chase Vehicle Operator

Gatik
NEW
SpringdaleSpringdale, Argentina
Full Time
6h
$0k-0k/hr
Gatik logo

Verification & Validation (V&V) Engineer – HIL, Simulation & Autonomy Validation

Gatik
NEW
Mountain ViewMountain View, Canada
Full Time
6h
$160k-220k
Gatik logo

Verification & Validation (V&V) Deployment Engineer

Gatik
NEW
Mountain ViewMountain View, Canada
Full Time
6h
$160k-220k

More jobs in Mountain View, Canada

Aeva logo

NPI Alignment Engineer

Aeva
NEW
Mountain ViewMountain View, Canada
Full Time
6h
$132k-179k
Nuro logo

Senior Software Engineer, Autonomy Visualization

Nuro
Mountain ViewMountain View, California
Full Time
May 6
$194k-291k
Nuro logo

Software Engineer, Data Platform

Nuro
Mountain ViewMountain View, California
Full Time
May 6
$160k-241k