• Design and implement scalable, production-grade pipelines for data ingestion, transformation, storage, and retrieval from vehicle fleets and simulation environments.
• Build internal tools and services for data labeling, curation, indexing, and cataloging across large and diverse datasets.
• Collaborate with ML researchers, autonomy engineers, and data scientists to design schemas and APIs that power model training, evaluation, and debugging.
• Develop and maintain feature stores, metadata systems, and versioning infrastructure for structured and unstructured data.
• Support the generation and integration of synthetic datasets with real-world logs to enable hybrid training and simulation workflows.
• Optimize pipelines for cost, latency, and traceability, ensuring reproducibility and consistency across environments.
• Partner with simulation and cloud platform teams to automate workflows for closed-loop testing, scenario mining, and performance analytics.
• Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
• 8+ years of experience building data-intensive software systems, ideally in robotics, autonomous driving, or large-scale ML environments.
• Proficient in Python, SQL, and familiar with C++.
• Experience designing ETL pipelines using modern frameworks (e.g., Apache Spark, Flyte, Union).
• Strong knowledge of cloud-native architectures, including AWS services (e.g., S3, or equivalents (Google Cloud platform)
• Familiarity with sensor data types (camera, lidar, radar, GPS/IMU) and common data serialization formats (e.g., protobuf. ROS2bag, MCAP).
• Deep understanding of data quality, observability, and lineage in high-volume systems.
• Track record of building reliable and performant infrastructure that supports both ad-hoc exploration and repeatable production workflows.
• Experience in AD/ADAS, robotics, or autonomous systems — especially handling perception or planning datasets.
• Familiarity with ML pipeline orchestration frameworks (e.g. Kubeflow, SageMaker, etc).
• Experience working with temporal or spatial data, including geospatial indexing and time-series alignment.
• Exposure to synthetic data generation, simulation logging, or scenario replay pipelines.
• Strong software engineering fundamentals, CI/CD, testing, code review, and service deployment best practices.
• Experience collaborating with cross-functional, distributed teams across research and production orgs.