Machine Learning Infrastructure Engineer

Astera Labs • $140K — $165K *

San Jose, CA 95123In-Person

Information Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

1-5 years of experience in software engineering, ML infrastructure, MLOps, or related backend roles.
Strong proficiency in Python and solid systems instincts.
Experience with AWS or GCP in managing production services.
Familiarity with inference deployments and model APIs for LLM/ML workloads.
Knowledge of observability, telemetry, and reliability engineering practices.
Understanding of evaluation systems and release workflows for AI applications.
Ability to translate complex infrastructure needs into scalable solutions.

Responsibilities

Build and enhance internal AI infrastructure for LLM applications and engineering workflows.
Oversee inference deployment paths, including access control and operational reliability.
Create platform layers such as model gateways and runtime integrations for safe scaling.
Develop AI Ops capabilities, including observability and incident response.
Establish monitoring dashboards and logging for production AI systems.
Optimize performance and cost through routing and caching strategies.
Design reusable APIs and SDKs that simplify AI system deployment and governance.

Full Job Description

Machine Learning Infrastructure Engineer

Location: San Jose, CA
Experience: 1-5 years
Team: Applied AI
The role

We're hiring a Machine Learning Infrastructure Engineer to build the runtime, platform, and operational backbone for modern AI systems. This role is for someone who wants to work on the systems behind the systems: model access layers, routing, serving paths, telemetry, observability, evaluation infrastructure, and the controls needed to make fast-moving AI work reliable in practice.

This is a platform role, but not in the old sense. The work is tightly coupled to how modern AI systems are actually built and used: multiple model providers, agent runtimes, skill and tool layers, inference telemetry, cost-aware routing, AI spend visibility, and governance that is strong enough for real internal adoption.

What you'll do

Build and improve internal AI infrastructure for LLM applications, agents, retrieval systems, and model-backed engineering workflows.
Own inference deployment paths across managed and self-serve environments, including access control, monitoring, and operational reliability.
Build platform layers such as model gateways, routing, runtime integrations, telemetry, and controls for safe execution at scale.
Develop AI Ops capabilities across evaluation, release readiness, observability, incident triage, regression detection, and cost monitoring.
Build dashboards, tracing, logging, and alerting for production AI systems, including spend and usage visibility across tools and teams.
Improve performance and unit economics through routing, caching, batching, failover, and latency/cost optimization.
Create reusable APIs, SDKs, and platform abstractions that make AI systems easier to deploy, evaluate, govern, and operate.

What we're looking for

1-5 years of experience in software engineering, ML infrastructure, MLOps, platform engineering, or related backend/infrastructure roles.
Strong Python plus strong systems instincts.
Experience with AWS or GCP and real production service ownership.
Familiarity with inference deployments, model APIs, gateways, serving systems, or runtime infrastructure for LLM/ML workloads.
Experience with observability, telemetry, reliability engineering, and incident response.
Understanding of eval systems, release workflows, retrieval-backed systems, and debugging non-deterministic AI behavior.
Ability to translate messy platform needs into scalable internal infrastructure.

What strong candidates often look like

They have built or operated systems where latency, routing, cost, telemetry, and reliability actually matter. They understand that modern AI infrastructure is not just about getting a model endpoint running. It is about building the runtime, visibility, controls, and developer experience that let an applied AI team move fast without losing quality or trust.

Why this role is interesting

The team is building AI-ready infrastructure in the most literal sense: observability, access control, AI spend tracking, secure managed platforms, skill/tool infrastructure, and telemetry that spans requests, tools, models, and outcomes. If you want to work on the platform layer that makes modern agentic systems possible - and do it in a setting where the downstream users are serious engineers with high expectations - this is that role.

The base pay compensation range for this role is between $140,000 - $165,000

About Astera Labs

Astera Labs is a semiconductor company that designs and develops purpose-built connectivity solutions for data-centric systems. The company's portfolio of products includes system-aware semiconductor integrated circuits (ICs), boards, and intellectual property (IP) that are used in data center servers, storage, and networking equipment. Astera Labs' products are designed to improve the performance, latency, and power consumption of data-centric systems. The company was founded in 2018 and is headquartered in Santa Clara, California.

Learn more about Astera Labs

Size

51 employees

Industry

Manufacturing & Automotive

Net Income

-$3 million

Founded

2018

Revenue

$5 million

NASDAQ

ALAB

* Ladders Estimates

Similar Jobs

Research Platform Architect
$140K — $170K *
Federal Reserve Bank
San Francisco, CA 94112 (San Francisco County)
Today
Sr Marketing Systems Integration Engineer
$133K — $179K *
Tailored Brands, Inc
Dublin, CA 94568 (Alameda County)
Today
Senior Systems Analyst (Enterprise Technology Solutions)
$128K — $156K *
Bay Area Air Quality Management District
San Francisco, CA 94112 (San Francisco County)
Today
GPU Hardware Security Architect
$130K — $180K *
Advanced Micro Devices, Inc
Santa Clara, CA 95051 (Santa Clara County)
Today
Systems Architect
$141K — $190K *
The Walt Disney Company
Glendale, CA 91205 (Los Angeles County)
Reposted Today
Systems Architect
$141K — $190K *
The Walt Disney Company
Santa Monica, CA 90405 (Los Angeles County)
Reposted Today

Get Ready For Your
Next Interview

More Jobs at Astera Labs

AI/ML Engineer
$140K — $165K *
San Jose, CA 95123 (Santa Clara County)
Today
Information Technology
In-Person
Machine Learning Infrastructure Engineer
$140K — $165K *
San Jose, CA 95123 (Santa Clara County)
Today
Information Technology
In-Person
Product Applications Engineer (NCG 2026)
$140K — $150K *
San Jose, CA 95123 (Santa Clara County)
Today
Technical Services
In-Person
Senior Design Verification Engineer |Afshin| SJC/ TDC
$100K — $130K *
Toronto, ON M3C 0E3
Yesterday
Enterprise Technology
In-Person
Principal Design Verification Engineer |Afshin| SJC/ TDC
$140K — $175K *
Toronto, ON M3C 0E3
Yesterday
Telecommunications & Hardware
In-Person

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
Yesterday
Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
2 weeks ago
Senior Principal Engineer, Design Verification
$184K — $272K *
Marvell Technology
Morrisville, NC 27560 (Wake County)
Today
Senior Staff Engineer, Design Verification
$151K — $223K *
Marvell Technology
Westborough, MA 01581 (Worcester County)
Today
Senior Silicon Validation Engineer (High-Speed SerDes)
$91K — $137K *
Marvell Technology
Santa Clara, CA 95051 (Santa Clara County)
Reposted Today

Find similar Machine Learning Infrastructure Engineer jobs:

Nationwide San Jose, CA

Machine Learning Infrastructure Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Machine Learning Infrastructure Engineer jobs:

Get Ready For Your
Next Interview