Platform Support Engineer

Lightning AI

• $115K — $140K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

1 week ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Strong software engineering and systems troubleshooting background
Experience with Kubernetes and containerized environments
Knowledge of Linux systems, including networking and performance tuning
Familiarity with cloud infrastructure and distributed systems
Hands on experience with observability and debugging tools like Prometheus or Grafana
Experience operating machine learning workloads in production
Strong communication skills to engage with technical customers

Responsibilities

Partner with customer engineering teams to support training and inference workloads
Help diagnose and resolve complex infrastructure issues
Act as a technical advisor during incidents and platform degradation
Translate infrastructure issues into actionable guidance for ML engineers
Investigate failures in distributed training and GPU orchestration
Analyze logs and metrics to isolate root causes
Drive long-term reliability improvements based on recurring patterns

Benefits

Comprehensive medical, dental, and vision coverage
Retirement and financial wellness support
Generous paid time off and holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment

Full Job Description

What We're Looking For

We're looking for engineers who understand the realities of running machine learning workloads at scale.

This role sits at the intersection of ML systems, cloud infrastructure, Kubernetes, and customers. You'll support engineers training models, deploying inference systems, and scaling GPU workloads in production.

You are not a ticket router or traditional support engineer. You are a technical partner to ML teams - helping diagnose failures, improve reliability, and guide customers through complex distributed systems problems.

The problems range from Kubernetes scheduling and GPU orchestration to distributed PyTorch failures, inference latency, networking bottlenecks, storage performance, and platform reliability.

You'll gain exposure to a wide variety of real world AI workloads across industries and help shape the infrastructure powering the next generation of ML applications.

What You'll Do

Work Directly With ML Engineers

Partner directly with customer engineering teams running training and inference workloads in production
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
Act as a technical advisor during high impact incidents and platform degradation events
Translate infrastructure level issues into actionable guidance for ML engineers
Build credibility with customers through strong technical reasoning and clear communication

Debug ML Infrastructure & Distributed Workloads

Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
Analyze logs, metrics, traces, and system behavior to isolate root causes
Debug containerized workloads running across Kubernetes and bare metal GPU environments
Support customers scaling workloads across multi node GPU systems
Diagnose performance bottlenecks involving compute, memory, networking, or storage

Improve Reliability & Platform Operations

Identify recurring patterns across customer issues and drive long term reliability improvements
Contribute to post incident reviews and operational improvements
Build internal tooling, automation, documentation, and runbooks
Partner closely with infrastructure, networking, and platform engineering teams
Help improve observability, operational visibility, and troubleshooting workflows
Improve the customer experience through better processes and technical guidance

What This Role Is Not

To set clear expectations:

This is not a traditional help desk or ticket routing support role
This is not purely customer success or account management
This is not a backend engineering role
This is not a passive escalation position

This role is for engineers who enjoy solving difficult technical problems while working closely with other engineers.

What You'll Need

Required Qualifications

Infrastructure & Systems

Strong software engineering and systems troubleshooting background
Experience with Kubernetes and containerized environments
Linux systems knowledge, including networking, storage, process management, and performance tuning
Experience with cloud infrastructure and distributed systems
Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry

ML Infrastructure Experience

Hands on experience operating machine learning workloads in production or research environments
Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
Familiarity with GPU infrastructure and orchestration
Experience troubleshooting performance, reliability, or scaling issues in ML infrastructure
Understanding of the operational challenges involved in running ML systems at scale

Collaboration

Strong communication skills and ability to work directly with highly technical customers and engineering teams
Comfortable operating in fast moving, highly ambiguous environments
Enjoys solving complex technical problems collaboratively

Nice-to-Haves

Experience with large scale model training or distributed inference systems
Familiarity with Ray, Kubeflow, Slurm, or similar distributed scheduling platforms
Experience with InfiniBand, RDMA, or high-performance networking
Experience operating bare metal infrastructure
Familiarity with storage systems commonly used in ML environments
Experience working at an AI infrastructure, cloud, MLOps, or developer tooling company
Contributions to platform engineering, developer infrastructure, or operational tooling projects
Experience writing automation, tooling, or scripts in Python or similar languages

This role is hybrid out of our Seattle or San Francisco offices, with an in-office requirement of at least 2 days per week and occasional team and company offsites. The role follows a Monday-Friday schedule, with working hours from 8:00 AM to 5:00 PM PST. We are not able to provide visa sponsorship for this role at this time.

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.

The anticipated annual base salary range for this role is:

$115,000-$140,000 USD

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
Generous paid time off, plus holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment

* Ladders Estimates

Similar Jobs

Remote Support Engineer - CT/MR/VL
$86K — $138K *
Canon Medical Systems
Remote
Today
Remote Support Engineer - CT/MR/VL
$86K — $138K *
Canon Medical Systems
Remote
Today
Staff Technical Support Engineer - White Glove
$113K — $199K *
ServiceNow
Santa Clara, CA 95051 (Santa Clara County)
Today
Remote Support Engineer -XR/VL
$86K — $138K *
Canon Medical Systems
Remote
Today
Customer Engineer
$120K — $150K *
Anyscale
San Francisco, CA 94112 (San Francisco County)
Today
Customer Engineer
$120K — $150K *
Anyscale
Palo Alto, CA 94303 (Santa Clara County)
Today

Get Ready For Your
Next Interview

More Jobs at Lightning AI

Network Operations Center (NOC) Analyst
$85K — $100K *
Fort Worth, TX 76137 (Tarrant County)
5 days ago
Information Technology
In-Person
Network Operations Center (NOC) Analyst
$85K — $100K *
Lisle, IL 60532 (Dupage County)
5 days ago
Information Technology
In-Person
Platform Support Engineer
$115K — $140K *
Seattle, WA 98115 (King County)
1 week ago
Information Technology
In-Person
Platform Support Engineer
$115K — $140K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
In-Person
Senior Technical Writer, Developer Experience
$150K — $250K *
New York, NY 10025 (New York County)
Reposted 1 week ago
Enterprise Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Customer Support
Confidential Company
Austin, TX 78701 (Travis County)
2 weeks ago
Sr Assoc, Cyber Sec ThreatMgmt - Detection Engineer
$88K — $151K *
Northern Trust
Naperville, IL 60540 (Dupage County)
Today
Global Director – Vulnerability Management & Security Configuration
$164K — $288K *
Northern Trust
Chicago, IL 60629 (Cook County)
Today

Find similar Platform Support Engineer jobs:

Nationwide San Francisco, CA

Platform Support Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Platform Support Engineer jobs:

Get Ready For Your
Next Interview