Platform Support Architect

Data Direct Networks

$175K — $200K *
US-AnywhereRemote in California, US
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years in Linux-based infrastructure roles (SRE, MLOps, platform engineering, or L2/L3 support) supporting production systems; 8+ years total technical experience preferred.
  • Strong hands-on experience with containers and Kubernetes (Docker/containerd, Helm, Operators; debugging pods, etc.).
  • Demonstrated experience operating GPU-accelerated workloads in production (NVIDIA GPUs, drivers, CUDA concepts).
  • Practical experience with AI storage and networking for HPC/AI clusters (high-performance storage systems, RDMA-accelerated networking).
  • Experience with one or more vector databases (Milvus, Qdrant, etc.).
  • Solid understanding of RAG and Generative AI workflows.
  • Excellent communication skills, capable of clearly explaining complex AI topics.

Responsibilities

  • Act as primary NVIDIA AI Enterprise and vector database expert for HyperPOD customer environments.
  • Own complex end-to-end triage across GPU, NVAIE services, vector DB, Kubernetes, and networking.
  • Diagnose and resolve performance bottlenecks in RAG and agentic AI workflows.
  • Collect and interpret logs and telemetry across Linux, containers, GPUs, and storage.
  • Author and maintain support triage runbooks and checklists for HyperPOD.
  • Build hands-on labs and PoCs that mirror customer RAG and AI use cases on HyperPOD.
  • Provide structured feedback from field cases into Product Management and Engineering.

Benefits

  • Work in a cutting-edge AI environment with leading technology partners.
  • Opportunity to be a trusted technical advisor within the organization and with external partners.
  • Access to unique training and development opportunities in the AI sector.
  • Work remotely with a flexible work environment.
  • Contribute to innovative solutions that shape AI Data platform landscapes.
Full Job Description
Overview

DDN is expanding our Enterprise and Sovereign AI Solution offerings, for example Hyperpod - a turnkey NVIDIA AI Data Platform built on DDN Infinia storage, NVIDIA AI Enterprise (NVAIE), and Supermicro reference hardware, optimized for inference and RAG workloads. Our support organization is deep on storage (Infinia, EXAScaler); we are now hiring an AI platform specialist to lead supportability and enablement for the AI side of the stack – NVIDIA AI Enterprise services (NIMs, NeMo, Triton, GPU Operator, licensing), vector databases (initially Milvus), RAG/agentic workflows, and the highperformance storage and networking fabric that underpins them.

You will be a trusted technical advisor within Support and across OEM and NVIDIA partner teams, combining the mindset of a solutions architect (architecture, reference patterns, PoCs, reusable assets) with that of a L3 support engineer. Youll help DDN and our partners operate AI Data solutions as a cohesive AI platform, not just a collection of components.

Key Responsibilities Platform support
  • Act as the primary NVIDIA AI Enterprise and vector database solutions expert for HyperPOD customer environments, bringing deep knowledge of NVAIE services (e.g., NIMs, NeMo, Triton, TensorRT/TensorRTLLM, GPU Operator, licensing/NLS) and vector databases (e.g., Milvus) to guide diagnosis, optimization, and solution design.
  • Own complex endtoend triage across GPU, NVAIE services, vector DB, Kubernetes, Docker, highspeed networking, and Infinia storage, distinguishing product defects from environmental and integration issues.
  • Diagnose and resolve performance bottlenecks in RAG and agentic AI workflows, from model selection and prompt/RAG configuration throughto vector search, GPU utilization, and data access patterns.
  • Collect and interpret logs and telemetry across Linux, containers, Kubernetes, GPU stack, vector DB, and storage/networking; build minimal repros and highquality defect reports for escalation to NVIDIA, vectorDB vendors, OEMs, and internal engineering.
Runbooks, diagnostics, and supportability
  • Author and maintain support triage runbooks and checklists for HyperPOD covering NVAIE services, Milvus/vector DB, GPU stack, Docker, Kubernetes resources, and their interaction with Infinia and the network fabric.
  • Define and validate unified diagnostics bundles that capture the right logs/configs/metrics from all relevant layers (Infinia, GPUs, NVAIE, Milvus, Kubernetes, network) to enable fast problem isolation and highsignal escalations.
  • Collaborate with observability and tools teams to shape Prometheus/Grafana/ELK/NetQ or equivalent dashboards that surface both platform health and RAG/servicelevel metrics (e.g., TTFT, retrieval latency, error rates, throughput).
Enablement, PoCs, and reusable assets
  • Build handson labs and PoCs that mirror customer RAG and agentic AI use cases on HyperPOD, validating supportability and capturing known good configurations and troubleshooting patterns.
  • Develop reusable technical assets 6 implementation guides, bestpractice playbooks, tuning checklists, example architectures 6 to accelerate timetovalue for customers, PS, and Support.
Design feedback, readiness, and crossfunctional leadership
  • Provide structured feedback from early field cases and PoCs into Product Management and Engineering on stack compatibility, upgrade order, rollback constraints, and observability needs for NVAIE, Milvus/cuVS, Infinia, and networking.
  • Collaborate closely with NVIDIA solutions architects, OEM architects, PS, and Support Innovation to align reference architectures and best practices with realworld support experience.
Required Experience & Skills Technical
  • 5+ years in Linuxbased infrastructure roles (SRE, MLOps, platform engineering, or L2/L3 support) supporting production systems; 8+ years total technical experience preferred.
  • Strong handson experience with containers and Kubernetes (Docker/containerd, Helm, Operators; debugging pods, DaemonSets, CSI, CNI, and ingress/load balancers).
  • Demonstrated experience operating GPUaccelerated workloads in production:
    • NVIDIA GPUs, drivers, CUDA concepts, GPU utilization/perf triage
    • NVIDIA GPU Operator and Kubernetesbased GPU lifecycle management
    • Familiarity with DGX / HGX or similar GPU cluster platforms.
  • Practical experience with AI storage and networking for HPC/AI clusters:
    • Highperformance storage systems (e.g., EXAScaler/Lustre, GPFS, Ceph, distributed object storage, enterprise NAS/SAN).
    • RDMAaccelerated and/or highspeed Ethernet/InfiniBand networking, including fabrics, switch topologies, and largescale deployments.
    • Hybrid cloud or cloudadjacent patterns (Kubernetes CSI, cloudnative fabrics, data locality).
  • Experience with one or more vector databases (Milvus, Qdrant, Pinecone, pgVector, OpenSearch/Elasticsearch vectors, etc.), including schema design, ingestion, and operations.
  • Solid understanding of RAG and Generative AI workflows: embeddings, retrieval, reranking, prompt design, context management, and how these interplay with vector search and GPU inference at scale.
  • Familiarity with NVIDIA AI Enterprise components and toolchain, for example:
    • NVIDIA NIM inference microservices
    • NVIDIA NeMo framework / NeMo Retriever / NeMo Curator
    • Triton Inference Server, TensorRT / TensorRTLLM, CUDA libraries
    • NVIDIA blueprints for enterprise RAG and agentic AI.
  • Experience designing, operating, or supporting MLOps / GenAI pipelines: CI/CD for models, deployment strategies, canarying/rollback, GPU resource management, monitoring and alerting for AI services.
  • Strong diagnostic skills across Linux, containers, Kubernetes, GPUs, storage, and networking; able to quickly narrow fault domains and propose experiments or configuration changes.
Support, architecture, and stakeholder skills
  • Track record of building reusable technical assets (runbooks, KBs, implementation guides, benchmarks, PoC templates) that improve support readiness and partner/customer success.
  • Excellent communication skills, capable of clearly explaining complex AI platform topics to both engineers and executive stakeholders, internally and with partners.
Preferred Qualifications
  • Prior experience with scaleout storage in GPU/AI environments.
  • Direct experience crafting and operating RDMAaccelerated HPC/AI clusters at scale, including spineleaf or fattree network designs and large switch/router deployments.
  • Handson work with NVIDIA reference blueprints (Enterprise RAG, VSS, AIQ, industryspecific blueprints) or similar enterprise AI architectures.
  • Familiarity with AI observability and responsible AI practices (guardrails, monitoring for drift/toxicity, basic understanding of regulatory considerations like GDPR/HIPAA in the context of AI systems).
  • Experience with observability stacks (Prometheus, Grafana, Loki/ELK, NetQ, etc.) tuned for AI workloads, including servicelevel dashboards and SLOs.
What Success Looks Like in This Role

Within 612 months, a successful AI Data Platform Solutions Architect will have:

  • Become the goto internal expert for how this AI and networking stack actually works in production across Support, PS, Product, and NPI for HyperPOD.
  • Drive speed and quality of support at solution level; NVAIE, vector DB, and AIworkflow issues through highquality diagnostics, architecture insight, and welldefined golden stack patterns.
  • Established clear, repeatable triage and escalation patterns for AIside incidents that L1/L2 storage engineers can follow with confidence.

Salary Range for this role: $175,000 - $200,000

DDN

DDN has a very strong orientation towards these 4 characteristics and any successful employee will demonstrate these capabilities:

Self-Starter - Takes independent action to identify and solve problems. Seeks out relevant information needed to make decisions. Gets involved with new initiatives.

Success/Achievement Orientation - Delivers quality results consistently. Targets, achieves (or exceeds) measurable results. Sets challenging goals, focuses on critical priorities, and is accountable.

Problem Solving - Recognizes problems and responds with a systematic assessment that identifies and addresses cause of issue. Practical, realistic, and resourceful.

Innovative - Builds and improves key business processes that enhance the effectiveness of DDN. Generates new ideas, challenges the status quo, and solves problems creatively.

#LI-Remote

Similar Jobs

More Jobs at Data Direct Networks

More Information Technology Jobs

Find similar Platform Support Architect jobs: