The role involves optimizing machine learning infrastructure for self-driving technology, focusing on improving training efficiency, input pipelines, and system performance to accelerate autonomous vehicle development.
Key Responsibilities
Collaborate with ML practitioners and infrastructure teams to integrate optimized input pipelines into workflows
Detect, diagnose, and resolve performance bottlenecks in training, evaluation, and model distillation workflows
Optimize training performance, resource utilization, and ensure reproducible model training outcomes
Enhance input data pipelines to increase runtime efficiency and maximize accelerator utilization
Champion best practices for robust, reproducible, and debuggable ML experimentation
Requirements
A B.S., M.S., or Ph.D. in Computer Science, Electrical Engineering, or related technical field or equivalent experience.
At least 4 years of professional experience in ML infrastructure, distributed training, or ML systems engineering, scaling models on multi-node, multi-accelerator clusters.
Understanding of training, evaluation, and distillation workflows for billion-parameter models.
Expert-level knowledge in distributed systems and remote Python.
Strong skills in profiling, debugging, and optimizing quantized workloads.
Experience with ML compilers and strategies to reduce startup overhead.
Familiarity with model distillation and efficient inference workflows.
Hands-on experience with Foundation Model infrastructure.
Highly proficient in C++, distributed systems, and ML framework internals such as NCCL, Horovod, DeepSpeed, or Ray.
Benefits & Perks
Compensation/salary range between $235,030 and $352,290 depending on experience and qualifications