Amazon

Sr. System Development Engineer, AL/ML/Storage server team

Amazon$173K — $235K *
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 6+ years of professional software development experience
  • 6+ years in systems design, software development, operations, and automation
  • 6+ years in designing or architecting systems with focus on reliability
  • 5+ years of programming experience in a modern language like C++, Java, or Python
  • Experience with Linux/Unix systems
  • Proven track record in deploying complex and scalable software solutions

Responsibilities

  • Lead the development of automation infrastructure for server health
  • Design predictive failure detection systems using telemetry and log analysis
  • Build zero-touch operations for automatic fault management
  • Develop monitoring tools and dashboards for real-time fleet insights
  • Debug and resolve complex system issues across multiple platforms
  • Build diagnostic tools for automated root cause analysis
  • Collaborate cross-functionally to enhance server hardware and software solutions

Benefits

  • Comprehensive health insurance coverage
  • 401(k) matching program
  • Paid time off
  • Parental leave benefits
  • Flexibility in spending accounts
  • Support for adoption and surrogacy reimbursement
Full Job Description
Application deadline: May 26, 2026

We are seeking an experienced Senior Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy - with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention.

You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components - leading delivery yourself and through others in parallel - using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge.

You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations - driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI).

Key job responsibilities

Fleet Health & Predictive Infrastructure

- Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms

- Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact

- Drive toward zero-touch operations - building automation that detects, diagnoses, triages, and remediates hardware and software faults without human intervention

- Develop monitoring tools, dashboards, and alerting systems to provide real-time visibility into fleet health across lab and production environments

- Define and track fleet health metrics (failure rates, mean time to detect, mean time to repair, first-time fix rate, predictive accuracy)

Debugging & Troubleshooting

- Debug and resolve complex system-level issues across storage, compute, GPU, networking in production environments

- Troubleshoot Linux boot and runtime failures across x86 and ARM architectures, including PCIe, power, NIC, NVMe, and GPU subsystems

- Perform root cause analysis on hardware failures - correlating across firmware, kernel, driver, and physical layer to isolate faults

- Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage

- Improve manufacturing throughput and yield through test optimization

Systems Development & Automation

- Lead the definition and development of software, automation, and enabling tools for server hardware programs; track and report progress

- Design and build scalable system-level software with focus on durability, availability, security, and diagnostics

- Develop and maintain device drivers for Linux on ARM and x86 architectures

- Build automation solutions using modern programming languages (Python, Ruby, Java, C/C++, etc.)

- Work with OS internals, storage subsystems, and accelerator/GPU software stacks in Linux-based environments

- Build, manage, and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems

Cross-Team Collaboration

- Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams

- Work closely with internal customers to identify early any potential problems onboarding new servers - storage or accelerated compute - into their ecosystem

- Engage with ODMs and design partners on testability, diagnostic, and automation requirements during hardware design and development

- Contribute to server design to improve robustness, testability, diagnosability, and reliability

- Partner with datacenter operations teams to close the loop between field failures and design improvements

A day in the life

Systems Development Engineers in AWS Hardware Engineering wear many hats. From orchestration tooling development to hardware integration to kernel driver debugging, we dive deep into problems across the breadth of AWS. Our teams are directly responsible for launching and maintaining server hardware in the fleet - including storage servers powering distributed storage platforms and AI/ML accelerator servers with GPUs. Located in Seattle and Cupertino, we work with internal development teams, ODMs, and design partners to deliver servers deployed in datacenters worldwide.

BASIC QUALIFICATIONS

- 6+ years of non-internship professional software development experience

- 6+ years of systems design, software development, operations, automation, and process improvement experience

- 6+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience

- 5+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience

- Experience with Linux/Unix

- Experience leading the design, build and deployment of complex and performant (reliable and scalable) software solutions in production

PREFERRED QUALIFICATIONS

- Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations

- Experience taking a leading role in building complex software or computing infrastructure that has been successfully delivered to customers

- Experience building predictive failure detection or proactive remediation systems at fleet scale

- Experience with Linux kernel driver development

- Experience with storage, compute, GPU/accelerator platforms (NVIDIA), including driver integration, diagnostics, or performance validation

- Experience with distributed storage systems (block, object, or file)

- - Familiarity with server hardware architecture, BMC/IPMI, firmware, PCIe topology, NVLink, and hardware diagnostics

- Experience working with ODMs or hardware design partners through the product development lifecycle

- Experience building zero-touch or self-healing automation for large-scale infrastructure

- Experience working in large-scale datacenter or cloud environments

- Track record of rapidly coming up to speed on new engineering disciplines and making impactful decisions

- Experience with hardware bring-up, validation, and fleet-wide deployment

- Familiarity with telemetry pipelines, anomaly detection, and operational metrics at scale

- Familiarity with manufacturing workflows and yield improvement optimization

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, CA, Cupertino - 173,900.00 - 235,200.00 USD annually

USA, CO, Denver - 151,200.00 - 204,600.00 USD annually

USA, WA, Seattle - 151,200.00 - 204,600.00 USD annually

About Amazon

Audible is a provider of spoken audio information and entertainment , on the Internet. They provide premium spoken audio content, such as audio versions of books and newspapers and radio programs, that is delivered over the Internet and played back on personal computers and hand-held electronic devices. The Audible service allows consumers to purchase and download their content from their Website, store it in digital files and play it back on personal computers and electronic devices. More than 15,000 hours of audio content are available on their Web site, including audio versions of books, periodicals and radio programs. Several manufacturers have agreed to support and promote the playback of their content on their hand-held audio-enabled electronic devices.

Amazon Careers

Joining Amazon presents an unparalleled opportunity to become part of a vibrant team pushing the boundaries of innovation and growth in the global marketplace. As a leader in e-commerce, technology, and logistics, Amazon offers a variety of job opportunities that cater to a range of skills and professional interests. Work You’ll Do At Amazon, every day is an opportunity to collaborate with the brightest minds in technology and business to redefine what’s possible. Whether you’re interested in software development, marketing, human resources, or customer service, Amazon has a position waiting for you. Transform the way the world shops and innovates with our diverse and inclusive team. Amazon is not just a company; it’s a community where you can drive real change and contribute to projects impacting millions globally. Lead with Innovation and Leadership Amazon is the perfect place to enhance your leadership and innovation skills. Our culture encourages pushing the envelope and imagining the unimaginable. Here, you will lead projects that challenge the status quo and define new industry standards. Work with a team that values diversity and is committed to creating an inclusive environment. Our leadership is focused on harnessing the collective power of unique perspectives to foster growth and innovation. Explore Amazon’s Employment Benefits Amazon’s commitment to its employees extends beyond just career growth. We offer competitive benefits, including health care, parental leave, and diversity training, ensuring that our team not only excels professionally but also enjoys well-being and security. Internship and Networking Opportunities Start your career with an Amazon internship and gain hands-on experience that matters. Our internships provide a gateway to full-time employment and an opportunity to network with professionals across various sectors of the company. Future-Proof Your Career With Amazon, your career path is filled with numerous opportunities for advancement. Our learning and development programs are designed to nurture your professional growth and keep you at the forefront of industry trends. Stay Connected Join Our Team Discover the job opportunities at Amazon that match your skills and interests. We are constantly on the lookout for passionate, curious, and innovative team players ready to make a difference. Keep Up to Date Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here. Job Alert Emails Customize your subscription to receive job alerts, the latest news, and insider tips tailored to your preferences. Explore the exciting and rewarding career opportunities that await at Amazon. Amazon is more than just a company—it’s a platform for building a promising future. Whether you’re starting or looking to advance your career, Amazon offers the resources, support, and network you need to succeed. Join us, and be a part of our continuing mission to be Earth's most customer-centric company.
Learn more about Amazon
Size
1,608 employees
Market Cap
$832.6 billion
Industry
Net Income
$21.3 billion
Founded
1994
5 Year Trend
+28.1%
Revenue
$386 billion
NASDAQ

Similar Jobs

More Jobs at Amazon

More Enterprise Technology Jobs

Find similar Sr. System Development Engineer, AL/ML/Storage server team jobs: