Infrastructure Operations Engineer

Lightning AI

• $160K — $200K *

New York, NY 10025In-Person

Enterprise Technology

8 - 10 years of experience

1 week ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

8+ years experience managing Linux systems, particularly Ubuntu is a plus.
5+ years working with AWS.
2+ years using Kubernetes with solid knowledge of container technologies.
2+ years utilizing Terraform and Ansible for automation tasks.
Proficient in network-attached storage management, familiarity with VAST systems is an advantage.
Experienced with monitoring solutions like Prometheus or ELK stack.
Understanding of gitops workflows.

Responsibilities

Design and deploy new platforms to reduce incidents and support features.
Implement updates for both internal needs and end customer requirements.
Work collaboratively with other teams including Engineering and Customer Success.
Share on-call duties with team members in a rotating structure.

Benefits

Comprehensive medical, dental and vision coverage for U.S. employees; private insurance for U.K. employees.
Retirement support in the U.S. and pension contributions in the U.K.
Generous paid time off including holidays.
Paid parental leave for new parents.
Support for professional development.
Stipends for wellness and remote work expenses.
Flexible work arrangements offered.

Full Job Description

What We're Looking For

Lightning AI is seeking an experienced Infrastructure Operations Engineers to help scale and operate our next-generation AI infrastructure platform. Our InfraOps team sits at the center of reliability, automation, and operational scale for GPU infrastructure. This team owns break/fix operations, incident response, customer provisioning, observability, and the automation systems that keep complex infrastructure running efficiently.

In this role, you'll work hands-on with large-scale GPU environments, Linux systems, bare metal infrastructure, provisioning workflows, and platform reliability. You'll partner closely with Infrastructure Engineering, Network Operations, and Software Platform teams to troubleshoot issues, improve operational efficiency, and build automation that reduces manual toil over time.

We're flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.

What You'll Do

At the direction of the Manager of Infrastructure Operations, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features.
Deploy updates and improvements to support both Voltage Park's internal and end customer use cases.
Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams.
Participate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary position.

What You Will Need
Required Qualifications

8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience.
5+ years experience with AWS.
2+ years experience with Kubernetes and strong container fundamentals.
2+ years experience with Terraform and Ansible
2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systems.
Experience with monitoring systems (Prometheus, ELK stack).
Familiarity with the gitops workflow.
Software development experience using Python, Go, bash, or other languages for the purposes of automation & connecting systems & APIs together.
Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband.
Experience building and delivering complex systems.
Effective at navigating tradeoffs between design, risk, cost, and outcomes.
Comfortable with navigating ambiguity.
Strong written and oral communication.

Nice-to-Haves

Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware.
Experience with GPU servers, both in bare metal form or under virtualization.
Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendors.
Experience with VAST storage systems

Compensation

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.

The anticipated annual base salary range for this role is:

$160,000-$200,000 USD

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
Generous paid time off, plus holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment

* Ladders Estimates

Similar Jobs

Infrastructure Operations Lead Cloud and AI GenAI Enablement
$142K — $195K *
Humana, Inc.
Boston, MA 02115 (Suffolk County)
2 days ago
Infrastructure Operations Lead Cloud and AI GenAI Enablement
$142K — $195K *
Appcast
Remote
4 days ago
Software Engineer - Infrastructure
$120K — $160K *
Modern Treasury
Remote
5 days ago
Infrastructure Operations Engineer
$160K — $200K *
Lightning AI
Remote
1 week ago
AI Infrastructure Engineer
$200K — $250K *
Jump Trading
New York, NY 10025 (New York County)
1 week ago
Engineering Manager, Infrastructure (Remote, US)
$200K — $240K *
Renew Home
Remote
2 weeks ago

Get Ready For Your
Next Interview

More Jobs at Lightning AI

Network Operations Center (NOC) Analyst
$85K — $100K *
Fort Worth, TX 76137 (Tarrant County)
5 days ago
Information Technology
In-Person
Network Operations Center (NOC) Analyst
$85K — $100K *
Lisle, IL 60532 (Dupage County)
5 days ago
Information Technology
In-Person
Platform Support Engineer
$115K — $140K *
Seattle, WA 98115 (King County)
1 week ago
Information Technology
In-Person
Platform Support Engineer
$115K — $140K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
In-Person
Senior Technical Writer, Developer Experience
$150K — $250K *
New York, NY 10025 (New York County)
Reposted 1 week ago
Enterprise Technology
In-Person

More Enterprise Technology Jobs

AI Enablement Specialist
$100K — $115K *
Axis Communications
Chelmsford, MA 01824 (Middlesex County)
Today
Configurator Developer Engineer (Oracle CPQ)
$85K — $110K *
Nidec Automatic Feed
St. Louis, MO 63129 (Saint Louis County)
Today
Manager, SAP SD Public Cloud
$100K — $130K *
KPMG
Calgary, AB T1Y 7M8
Today
Sr. ERP Developer
$160K — $165K *
Cape Cod Healthcare
Hyannis, MA 02601 (Barnstable County)
Today
Technical Program Manager - Engineering Systems Integration
$105K — $180K *
KLA Tencor
Ann Arbor, MI 48103 (Washtenaw County)
Reposted Today

Find similar Infrastructure Operations Engineer jobs:

Nationwide New York, NY

Infrastructure Operations Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Infrastructure Operations Engineer jobs:

Get Ready For Your
Next Interview