Sr. Staff Engineer, Lustre

Data Direct Networks

• $215K — $265K *

San Francisco, CA 94112In-Person

Information Technology

11 - 15 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

15+ years in distributed systems, filesystems, or Linux kernel development.
Hands-on expertise in LustreFS internals and production operations.
Strong C programming skills and advanced Linux debugging proficiency.
Deep understanding of Linux kernel concurrency and performance tuning.
Experience with high-performance networking (InfiniBand, RDMA, RoCE).
Ability to diagnose complex, cross-layer issues.
Proven leadership in design discussions and technical decisions.

Responsibilities

Lead technical direction across LustreFS subsystems including metadata and object storage.
Conduct root-cause analysis for challenging production issues.
Design and implement new features and reliability improvements in LustreFS.
Oversee architectural reviews for significant changes with a focus on operability.
Develop strategies for debugging complex distributed failure scenarios.
Collaborate with various engineering teams to enhance product quality and testing.
Mentor and upskill fellow engineers with structured learning and standards.

Benefits

Opportunity to work in a cutting-edge AI and HPC environment.
Lead and shape new engineering practices and workflows.
Direct impact on reducing engineering risk and improving resilience.
Potential for professional growth through mentorship opportunities.
Participate in open-source contributions and community engagement efforts.

Full Job Description

Job Description

We are seeking a Senior Staff Engineer - LustreFS with 15+ years of experience in distributed storage, Linux kernel and large-scale HPC/AI infrastructure. This role is intended for a deeply hands-on technical leader who can independently drive architecture, debugging, reliability and performance across LustreFS subsystems including metadata, object storage, recovery, LNet and high-performance transports such as RDMA/InfiniBand/RoCE. You will be expected to mentor senior engineers, shape technical direction, improve operational resilience, and help convert tribal knowledge into scalable engineering systems. Being AI-enabled for faster triage, debugging, design exploration and knowledge capture is a strong advantage.

Key Responsibilities

Provide deep technical leadership across LustreFS subsystems including llite, MDS/MDT, OSS/OST, LDLM, recovery and LNet.
Own complex root-cause analysis for difficult customer, scale and production issues across kernel, filesystem, network and transport layers.
Lead design and implementation of new features, reliability improvements, scale enhancements and performance optimizations in LustreFS.
Drive architectural reviews for kernel-space and user-space changes with strong attention to correctness, backward compatibility and operability.
Define debugging and observability strategies for complex distributed failure scenarios including failover, recovery storms, lock contention and transport degradation.
Partner with principal engineers, support, QE, DevOps and release teams to improve product quality, test depth and release confidence.
Mentor senior and mid-level engineers; create structured learning paths, review standards and subsystem ownership models to build redundancy.
Promote use of AI-assisted workflows for issue triage, log analysis, code review assistance, knowledge capture and design acceleration with appropriate engineering guardrails.

Required Qualifications

15+ years of experience in distributed systems, filesystems, Linux kernel development or storage infrastructure engineering.
Strong hands-on expertise in LustreFS internals and production operations, including one or more of: metadata services, object storage services, client/llite, locking, recovery or LNet.
Strong C systems programming skills and deep Linux debugging experience using tools such as gdb, crash, perf, ftrace, eBPF, systemtap and core analysis.
Strong understanding of Linux kernel concurrency, memory management, I/O paths, networking and performance tuning.
Experience with high-performance networking and transports such as InfiniBand, RDMA, RoCE and/or TCP at scale.
Proven ability to diagnose complex cross-layer issues spanning kernel, storage, networking and distributed coordination.
Experience leading design discussions, code reviews and subsystem-level technical decisions.
Excellent written and verbal communication skills with the ability to guide senior technical audiences and influence cross-functional teams.

Preferred Skills

Experience with large-scale AI/HPC clusters, parallel filesystems and performance-sensitive production environments.
Familiarity with backend storage filesystems and media such as ZFS, ldiskfs, NVMe and enterprise storage platforms.
Experience with upstream/open-source contribution models, patch review and long-term maintenance / backporting.
Experience building runbooks, failure-injection tests, automated diagnostics or observability pipelines for distributed storage.
Practical use of AI tools for log summarization, issue triage, code review augmentation, design exploration and knowledge-base generation.
Track record of growing engineering capability through mentoring, documentation and systematic knowledge transfer.

What You Will Work On

Complex customer escalations and deep production issues involving failover, recovery, locking, performance regression and transport instability.
Architecture and implementation of new LustreFS capabilities or subsystem enhancements for scale, resilience and serviceability.
Cross-layer debugging across llite, kernel, MDS/OSS, LNet and RDMA/InfiniBand environments.
Technical reviews of code changes, design proposals and release-readiness for critical fixes and long-lived branches.
Building AI-enabled engineering workflows that accelerate triage, debugging, code reviews, design iteration and structured knowledge capture.
Developing the next generation of Lustre engineers through mentorship, playbooks, design walkthroughs and repeatable debugging frameworks.

Why This Role Matters

This role is central to building durable engineering redundancy in LustreFS: expanding deep subsystem ownership, reducing concentration risk, and accelerating next-generation delivery through strong engineering fundamentals and AI-enabled execution.

Salary Range for this role: $215,000 - $265,000

Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:

Coding assessment: Often in a language of your choice.
Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
Meet and greet with the wider team.
Our goal is to finish the main process in 2-3 weeks at most.

#LI-Remote

* Ladders Estimates

Similar Jobs

Senior Presales Systems Engineer
$166K — $343K *
Hewlett Packard Enterprise Development LP
Fall River Mills, CA 96028 (Shasta County)
Reposted Today
Senior Systems Engineer
$146K — $234K *
Peraton
Remote
Today
System Software Engineer
$120K — $243K *
Hewlett Packard Enterprise Development LP
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Yesterday
Software Engineering Manager, Site Reliability Engineering
$207K — $301K *
Google
Sunnyvale, CA 94087 (Santa Clara County)
Yesterday
Sr. Staff Engineer, Lustre
$215K — $265K *
Data Direct Networks
Remote
Yesterday
Member of Technical Staff, Cloud Orchestration
$200K — $400K *
Inferact
San Francisco, CA 94112 (San Francisco County)
Yesterday

Get Ready For Your
Next Interview

More Jobs at Data Direct Networks

Senior Staff Software Engineer, S3
$215K — $265K *
Remote
Today
Enterprise Technology
Remote in California, US
Software Engineer III
$110K — $145K *
Raleigh, NC 27610 (Wake County)
Today
Information Technology
In-Person
Senior Staff Software Engineer, S3
$215K — $265K *
Raleigh, NC 27610 (Wake County)
Today
Information Technology
In-Person
Principal Solutions Architect - FSI
$200K — $224K *
Remote
Reposted Yesterday
Enterprise Technology
Remote
Sr. Staff Engineer, Lustre
$215K — $265K *
Remote
Yesterday
Information Technology
Remote in California, US

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
1 week ago
Sr. QA Engineer
$90K — $110K *
Highlights for Children
Columbus, OH 43230 (Franklin County)
Today
Sr Software Engineer I - Global Servicing Technology
$120K — $150K *
American Express
Phoenix, AZ 85032 (Maricopa County)
Reposted Today
Solutions Consultant, AI
$175K — $225K *
AvePoint
Los Angeles, CA 90011 (Los Angeles County)
Today
Tech Lead
$120K — $150K *
Keylent, Inc.
San Jose, CA 95123 (Santa Clara County)
Reposted Today

Find similar Sr. Staff Engineer, Lustre jobs:

Nationwide San Francisco, CA

Sr. Staff Engineer, Lustre

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Sr. Staff Engineer, Lustre jobs:

Get Ready For Your
Next Interview