Job Description
We are seeking a Staff Engineer – LustreFS with 10+ years of experience in distributed storage and Linux-based systems engineering. This is a hands-on senior technical role focused on design, debugging, performance, and operational excellence across LustreFS and adjacent stack components. The ideal candidate brings strong expertise in one or more Lustre subsystems, can independently drive complex investigations, and collaborates effectively across engineering, QE, support and release teams. Engineers who are comfortable using AI to accelerate triage, debugging, code comprehension and new feature design will be especially valuable.
Key Responsibilities
- Design, develop and debug LustreFS features, fixes and enhancements across relevant subsystems such as llite, MDS/MDT, OSS/OST, LDLM and LNet.
- Investigate customer and scale-related defects, drive root-cause analysis and implement high-quality fixes with strong attention to correctness and maintainability.
- Contribute to performance tuning, failure analysis and reliability improvements for large-scale Lustre deployments.
- Participate actively in code reviews, design reviews and subsystem discussions, bringing rigor to testing and operational readiness.
- Work closely with QE and support to reproduce issues, improve diagnostic data quality and increase coverage for high-risk failure scenarios.
- Help document subsystem behavior, debugging approaches, known failure patterns and operational best practices.
- Use AI-assisted tools where appropriate to speed up issue triage, summarize logs, improve code understanding and capture reusable lessons learned.
Required Qualifications
- 10+ years of experience in systems software, distributed systems, storage, Linux kernel or filesystem engineering.
- Strong experience in LustreFS development, support or performance engineering with depth in at least one major subsystem.
- Strong C programming and Linux systems debugging skills.
- Working knowledge of Linux kernel internals, filesystem semantics, networking and performance analysis.
- Experience with LNet and/or high-performance transports such as RDMA, InfiniBand, RoCE or TCP-based storage networking.
- Ability to debug and resolve issues spanning multiple layers including client, server, network and backend storage.
- Strong collaboration skills and the ability to work across functions in a fast-moving engineering environment.
Preferred Skills
- Experience in HPC, AI infrastructure or large-scale parallel storage environments.
- Exposure to metadata-heavy and throughput-heavy workload characterization and tuning.
- Familiarity with ZFS, ldiskfs, NVMe-backed storage and related observability / performance tooling.
- Experience creating test plans, reproducer frameworks, runbooks or diagnostic automation.
- Comfort using AI tools to accelerate debugging, code reviews, triage, documentation and early-stage design ideation.
- Experience mentoring junior engineers or leading focused technical efforts within a subsystem.
What You Will Work On
- Hands-on development and debugging of LustreFS defects, performance issues and subsystem enhancements.
- Customer-facing and scale-related issue investigation across llite, metadata, object storage, LNet and transport layers.
- Collaborative design and implementation of reliability, observability and serviceability improvements.
- Reviewing and validating fixes through targeted tests, failure injection, log analysis and performance characterization.
- Using AI-assisted workflows to accelerate triage, debug loops, code understanding and documentation quality.
- Contributing to team redundancy by strengthening documentation, code review quality and subsystem knowledge sharing.
Why This Role Matters
This role is central to building durable engineering redundancy in LustreFS: expanding deep subsystem ownership, reducing concentration risk, and accelerating next-generation delivery through strong engineering fundamentals and AI-enabled execution.
Salary Range for this role: $185,000 - $230,000
Join our dynamic and driven team, where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here, you'll have the opportunity to work across various areas of the company, thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results, both in their work ethic and deliverables, making strong prioritization skills essential. Additionally, we value strong communication skills in all our engineers and researchers, as they are crucial for the success of our teams and the company as a whole.
Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:
- Coding assessment: Often in a language of your choice.
- Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
- Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
- Meet and greet with the wider team.
- Our goal is to finish the main process in 2-3 weeks at most.
#LI-Remote