Job DescriptionWe are seeking a Senior Staff Engineer - LustreFS with 15+ years of experience in distributed storage, Linux kernel and large-scale HPC/AI infrastructure. This role is intended for a deeply hands-on technical leader who can independently drive architecture, debugging, reliability and performance across LustreFS subsystems including metadata, object storage, recovery, LNet and high-performance transports such as RDMA/InfiniBand/RoCE. You will be expected to mentor senior engineers, shape technical direction, improve operational resilience, and help convert tribal knowledge into scalable engineering systems. Being AI-enabled for faster triage, debugging, design exploration and knowledge capture is a strong advantage.
Key Responsibilities- Provide deep technical leadership across LustreFS subsystems including llite, MDS/MDT, OSS/OST, LDLM, recovery and LNet.
- Own complex root-cause analysis for difficult customer, scale and production issues across kernel, filesystem, network and transport layers.
- Lead design and implementation of new features, reliability improvements, scale enhancements and performance optimizations in LustreFS.
- Drive architectural reviews for kernel-space and user-space changes with strong attention to correctness, backward compatibility and operability.
- Define debugging and observability strategies for complex distributed failure scenarios including failover, recovery storms, lock contention and transport degradation.
- Partner with principal engineers, support, QE, DevOps and release teams to improve product quality, test depth and release confidence.
- Mentor senior and mid-level engineers; create structured learning paths, review standards and subsystem ownership models to build redundancy.
- Promote use of AI-assisted workflows for issue triage, log analysis, code review assistance, knowledge capture and design acceleration with appropriate engineering guardrails.
Required Qualifications- 15+ years of experience in distributed systems, filesystems, Linux kernel development or storage infrastructure engineering.
- Strong hands-on expertise in LustreFS internals and production operations, including one or more of: metadata services, object storage services, client/llite, locking, recovery or LNet.
- Strong C systems programming skills and deep Linux debugging experience using tools such as gdb, crash, perf, ftrace, eBPF, systemtap and core analysis.
- Strong understanding of Linux kernel concurrency, memory management, I/O paths, networking and performance tuning.
- Experience with high-performance networking and transports such as InfiniBand, RDMA, RoCE and/or TCP at scale.
- Proven ability to diagnose complex cross-layer issues spanning kernel, storage, networking and distributed coordination.
- Experience leading design discussions, code reviews and subsystem-level technical decisions.
- Excellent written and verbal communication skills with the ability to guide senior technical audiences and influence cross-functional teams.
Preferred Skills- Experience with large-scale AI/HPC clusters, parallel filesystems and performance-sensitive production environments.
- Familiarity with backend storage filesystems and media such as ZFS, ldiskfs, NVMe and enterprise storage platforms.
- Experience with upstream/open-source contribution models, patch review and long-term maintenance / backporting.
- Experience building runbooks, failure-injection tests, automated diagnostics or observability pipelines for distributed storage.
- Practical use of AI tools for log summarization, issue triage, code review augmentation, design exploration and knowledge-base generation.
- Track record of growing engineering capability through mentoring, documentation and systematic knowledge transfer.
What You Will Work On- Complex customer escalations and deep production issues involving failover, recovery, locking, performance regression and transport instability.
- Architecture and implementation of new LustreFS capabilities or subsystem enhancements for scale, resilience and serviceability.
- Cross-layer debugging across llite, kernel, MDS/OSS, LNet and RDMA/InfiniBand environments.
- Technical reviews of code changes, design proposals and release-readiness for critical fixes and long-lived branches.
- Building AI-enabled engineering workflows that accelerate triage, debugging, code reviews, design iteration and structured knowledge capture.
- Developing the next generation of Lustre engineers through mentorship, playbooks, design walkthroughs and repeatable debugging frameworks.
Why This Role MattersThis role is central to building durable engineering redundancy in LustreFS: expanding deep subsystem ownership, reducing concentration risk, and accelerating next-generation delivery through strong engineering fundamentals and AI-enabled execution.
Salary Range for this role: $215,000 - $265,000
Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:
- Coding assessment: Often in a language of your choice.
- Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
- Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
- Meet and greet with the wider team.
- Our goal is to finish the main process in 2-3 weeks at most.
#LI-Remote