A Software Engineer within the AI Applications team responsible for designing and building scalable, reliable AI infrastructure, pipelines, and automation tools to support high-velocity AI experimentation and deployment in a cloud environment.
Key Responsibilities
Architect platform self-service capabilities with internal APIs and abstractions for AI environment provisioning
Develop internal tools and AI agents to automate infrastructure root-cause analysis and system optimization
Design and implement asynchronous, event-driven AI data pipelines using Kafka or RabbitMQ
Standardize AI deployment processes using Docker and Kubernetes to ensure reproducibility across environments
Implement advanced monitoring and observability for AI-specific metrics to ensure system reliability under heavy loads
Requirements
Extensive experience in Platform Engineering with a focus on writing clean, testable Go and Python code to manage complex cloud environments and GPU workloads.
Practical experience integrating LLM APIs, managing Vector Databases, and optimizing Retrieval-Augmented Generation (RAG) pipelines and distributed caches such as Redis.
Deep understanding of Go's goroutines and channels to handle the massive data throughput required by Kafka-driven AI pipelines.
Proven ability to manage Kubernetes (K8s) at scale, specifically extending the K8s API with custom controllers to make clusters AI-aware and managing specialized NVMe storage.
Sophisticated knowledge of asynchronous systems, knowing when to leverage Kafka for streaming versus RabbitMQ for complex task routing.
Ability to create internal APIs and Go-based abstractions that enable engineers to provision AI-ready environments including model weights, vector databases, and event streams with a single command.
Experience developing internal tools and AI Agents using Go and Python to automate root-cause analysis of infrastructure failures and proactively optimize system performance.
Experience designing and implementing asynchronous processing pipelines using Kafka or RabbitMQ to manage high-volume data ingestion for RAG systems and real-time model demand.
Experience standardizing AI deployments using Docker and Kubernetes to ensure models, prompts, and code are perfectly synchronized across all environments.
Implementing advanced monitoring in Prometheus and Grafana to track AI-specific metrics such as token latency and model drift, ensuring zero-downtime reliability under heavy inference loads.
Willingness to work from the Bengaluru office in accordance with company policies, unless on PTO, work travel, or other approved leave.
Benefits & Perks
Competitive salary range (not specified)
Work from the Bengaluru office in compliance with company policies
Flexible time off
Wellness resources
Company-sponsored team events
Support for growth and development
Inclusive and diverse work environment
Ready to Apply?
Join Pure Storage and make an impact in renewable energy