NextGenEnergyJobsRenewable Energy Jobs
CompaniesCitiesIndustries

NextGenEnergyJobs

The #1 platform for renewable energy careers. Join thousands of professionals who've found their dream jobs in renewable energy, sustainability, and renewable tech.

0+Newsletter subscribers
25K+Jobs posted
100+Companies

Sustainability Partners

Sustainability Software DirectoryRefurbished Tech Guide

Find Jobs

  • All Jobs
  • By Location
  • By State
  • International
  • By Industry
  • Top Companies
  • Job Titles

Job Types

  • Remote Jobs
  • Hybrid Jobs
  • Full-time
  • Part-time
  • Contract
  • Internships
  • Visa Sponsored

Experience

  • Entry Level
  • Mid Level
  • Senior Level
  • Executive
  • Remote Internships

Resources

  • Career Advice Hub
  • Top 10 Jobs
  • Solar Sales Salary
  • Become Solar Engineer
  • Salary Insights
  • CV Analyzer
  • Post a Job

Popular Job Locations

San Francisco
245 jobs
Boston
189 jobs
Denver
167 jobs
Austin
143 jobs
New York
298 jobs
Chicago
132 jobs
Seattle
201 jobs
Portland
98 jobs
Los Angeles
176 jobs
San Diego
87 jobs
Washington DC
203 jobs
Atlanta
112 jobs

Hot Remote Specializations

Project ManagerSolar SalesCustomer SuccessData EntryAll Data Entry
© 2026 NextGenEnergyJobs. All rights reserved.
Privacy PolicyTerms of ServiceAbout UsContact
  1. Home
  2. Jobs
  3. Senior AI Infrastructure Engineer
Gatik logo

Senior AI Infrastructure Engineer

Gatik
Mountain View, California
Full Time
Posted March 18, 2026
$180k - $240k
Not Specified
Apply Now

Application opens on company website

Job Description

The Senior AI Infrastructure Engineer at Gatik is responsible for designing, building, and scaling high-performance AI platforms that support autonomous driving models, focusing on distributed training, model deployment, infrastructure automation, and system monitoring to enable efficient and reliable autonomous vehicle operations.

Key Responsibilities

  • Design, build, and scale high-performance AI infrastructure for autonomous driving models
  • Support distributed training and experiment tracking for complex ML models
  • Optimize multi-GPU setups and low-level networking communication for large-scale training
  • Deploy and scale optimized AI models using inference engines like TensorRT, ONNX Runtime, and Triton
  • Develop and implement autonomous AI agents for infrastructure monitoring and hardware failure triage
  • Create automation tools for infrastructure management, model lifecycle, and data curation
  • Maintain and develop ML infrastructure leveraging MLFlow, Argo Workflows, and Kubernetes
  • Scale and manage large data pipelines and ETL processes using Apache Airflow, Kafka, and Spark
  • Implement monitoring and observability systems to track ML system metrics and infrastructure health

Requirements

  • Five or more years of experience in ML infrastructure, MLOps, or DevOps supporting high-scale compute environments.
  • Deep understanding of multi-GPU training strategies such as Fully Sharded Data Parallel (FSDP), DeepSpeed, Ray Train, and high-performance networking technologies including NCCL and InfiniBand.
  • Mastery of Kubernetes, Terraform, and Helm, with a focus on GPU-native orchestration and infrastructure automation.
  • Proven experience building or supporting Agentic Workflows for infrastructure or data automation, e.g., using large language models (LLMs) to drive DevOps tasks.
  • Expertise in MLFlow, Argo Workflows, and Kubernetes for platform management.
  • Strong experience with Docker, Kubernetes, and Helm for containerization.
  • Proficiency in Apache Airflow, Kafka, Spark, and GitOps automation for data pipelines and CI/CD processes.
  • Proficiency in Python and Bash scripting; experience with Go or Rust is a plus.
  • Experience supporting or building autonomous AI agents such as LangGraph, CrewAI, or AutoGen for monitoring GPU cluster health and automating hardware failure triage.
  • Experience designing, building, and scaling high-performance AI platforms that enable distributed training, experiment tracking, and model deployment.
  • Ability to optimize low-level communication protocols such as NCCL, InfiniBand, or RoCE v2 to minimize latency for large-scale training.
  • Ability to architect and optimize multi-GPU setups ensuring efficient model and data parallelism across clusters with H100 and A100 GPUs.
  • Experience deploying and scaling models using TensorRT, ONNX Runtime, and Triton Inference Server, including fine-tuning pipelines for real-time and batch inference.
  • Ability to develop agent-driven automation for infrastructure and data management, including model lifecycle automation and experiment tracking.
  • Experience in scaling ETL pipelines using Apache Airflow, Kafka, and Spark, and collaborating with data engineering teams to manage large datasets in storage solutions like S3, GCS, or Delta Lake.
  • Ability to define and track key ML system metrics such as training convergence, latency, throughput, and drift detection, and maintain deep visibility into platform health using Prometheus, Grafana, OpenTelemetry, and ELK Stack.
  • Experience supporting or building systems for monitoring AI-specific KPIs including model latency, inference throughput, and feature drift detection.
  • Bachelor’s degree in Computer Science, Engineering, or a related field (implied by the level of expertise required).

Benefits & Perks

Salary range of 180,000-240,000
Onsite work schedule 5 days a week
Work environment perks related to a collaborative and inclusive culture
Support for professional growth and development in ML infrastructure, MLOps, and DevOps
Opportunities to work on cutting-edge autonomous vehicle technology
Engagement in innovative projects with industry-leading companies like Walmart
Access to advanced hardware and software tools for AI and infrastructure development

Ready to Apply?

Join Gatik and make an impact in renewable energy

Apply Now

Stay Updated on Sustainability Jobs

Get the latest renewable energy jobs and career tips delivered to your inbox.

Job Alerts

Get notified about new sustainability jobs

More at Gatik

Senior Cloud Infrastructure Engineer

Mountain View

Office Coordinator and Executive Support Contract to Hire

Mountain View$120k

Senior Staff Robotics Integration Engineer

Mountain View$250k

Jobs in Mountain View, California

Mechanical Engineering, Intern

Nuro$8k

Senior Training Program Manager

Aurora$161k

Software Engineer I

Aurora$174k

More jobs at Gatik

Gatik logo

Senior Cloud Infrastructure Engineer

Gatik
NEW
Mountain ViewMountain View, California
Full Time
19h
Gatik logo

Office Coordinator and Executive Support Contract to Hire

Gatik
Mountain ViewMountain View, California
Contract
Mar 12
$80k-120k
Gatik logo

Senior Staff Robotics Integration Engineer

Gatik
Mountain ViewMountain View, California
Full Time
Mar 11
$170k-250k

More jobs in Mountain View, California

Nuro logo

Mechanical Engineering, Intern

Nuro
NEW
Mountain ViewMountain View, California
Full Time
19h
$8k-8k
Aurora logo

Senior Training Program Manager

Aurora
NEW
Mountain ViewMountain View, California
Full Time
19h
$111k-161k
Aurora logo

Software Engineer I

Aurora
NEW
Mountain ViewMountain View, California
Full Time
2d
$116k-174k