Senior ML Infrastructure Engineer

Orion Innovation • $120K — $160K *

Edison, NJ 08817In-Person

Information Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years of experience with Kubernetes and Azure Kubernetes Service (AKS)
Proficient in configuring and managing GPU node pools and CUDA
Strong background in Python for production-level applications
Familiarity with Hugging Face Transformers and state-of-the-art NLP models
Experience with Infrastructure as Code (IaC) using Terraform or Bicep
Ability to work with Azure SDK and manage Azure infrastructure
Willingness to learn new technologies is a plus

Responsibilities

Configure and manage GPU clusters for optimal performance
Design and implement Kubernetes jobs and autoscaling policies
Debug and resolve CUDA runtime issues for machine learning models
Oversee the end-to-end model serving process on Kubernetes
Integrate with Azure services like Data Lake Storage and OpenAI
Implement memory management strategies for efficient resource usage
Conduct performance profiling to enhance system capabilities

Benefits

Opportunity to work with cutting-edge technology in a dynamic environment
Flexible working hours to promote work-life balance
Access to professional development resources to enhance skills
Collaborative team culture encouraging innovation and creativity
Exposure to large-scale project management in a cloud environment

Full Job Description

Project Overview:

We're building a large-scale document intelligence platform that processes text files up to 5 TB in size, extracts insights using BERT-class NLP models, and surfaces answers to analysts via a low-latency query interface. The platform runs on Azure Kubernetes Service (AKS) with dedicated GPU node pools, uses KEDA for event-driven autoscaling, and integrates with Azure Data Lake Storage Gen2 and Azure OpenAI.

This is a hands-on role that sits at the intersection of platform engineering and applied ML, and requires someone who is equally comfortable debugging a CUDA out-of-memory error and designing a Kubernetes autoscaling policy. As the Senior ML Infrastructure Engineer the resource will own the end-to-end infrastructure layer - from GPU cluster configuration and CUDA runtime management to Kubernetes job orchestration and model serving.

Skill / Technology:

Level: Kubernetes / AKS
Expert: Multi-node-pool design, taint/toleration, autoscaler, GPU node pools (NC/ND series)
Senior: Device plugin, driver compat, resource limits, KEDA
Senior: Scaled Job, queue triggers, cooldown tuning, CUDA / cuDNN
Mid-Senior: Runtime config via PyTorch; raw kernel dev not required, PyTorch (GPU inference)
Senior: Batching, FP16, memory management, profiling, Hugging Face Transformers
Senior: BERT/DistilBERT/BGE loading, pipeline API, tokenization, Python (production)
Senior: Async workers, Azure SDK, queue consumers, Azure infrastructure
Senior: VNet, private endpoints, Key Vault, ADLS, AD, Docker / Helm
Senior: Multi-stage builds, Helm chart authoring, IaC (Terraform / Bicep)
Preferred: willingness to learn is acceptable

About Orion Innovation

Orion Innovation is a global technology services firm that provides IT solutions and services to businesses across various industries. The company offers digital transformation, product engineering, data analytics, cloud computing, and other IT services to help businesses improve their operations and achieve their goals. Orion Innovation has offices in the United States, Europe, and Asia, and serves clients in industries such as financial services, healthcare, retail, and more.

Learn more about Orion Innovation

Size

4,000 employees

Industry

Information Technology

* Ladders Estimates

Similar Jobs

Advanced Systems Engineer - clearance required
$130K — $144K *
General Dynamics
Dedham, MA 02026 (Norfolk County)
Today
Senior Demo Platform Engineer
$120K — $150K *
Keeper Security
Remote
Today
Eng Sr Prin II - Sys
$120K — $150K *
BAE Systems
Sterling, VA 20164 (Loudoun County)
Reposted Today
Eng Sr - Mod & Sim
$100K — $130K *
BAE Systems
Rockville, MD 20850 (Montgomery County)
Reposted Today
Eng Prin - Sys
$100K — $130K *
BAE Systems
Totowa, NJ 07512 (Passaic County)
Reposted Today
Senior Systems Engineer - Lab Facility
$112K — $125K *
General Dynamics
Pittsfield, MA 01201 (Berkshire County)
Reposted Today

Get Ready For Your
Next Interview

More Jobs at Orion Innovation

Senior ML Infrastructure Engineer
$120K — $160K *
Edison, NJ 08817 (Middlesex County)
Today
Information Technology
In-Person
AI Engineer
$100K — $150K *
Edison, NJ 08817 (Middlesex County)
Today
Technical Services
In-Person
Devops Manager
$120K — $150K *
Edison, NJ 08817 (Middlesex County)
Reposted Today
Information Technology
In-Person
Optical Network Engineer
$90K — $130K *
Dallas, TX 75217 (Dallas County)
Today
Telecommunications & Hardware
In-Person

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
1 week ago
Senior Software Engineer
$100K — $130K *
Liberty Mutual
Indianapolis, IN 46227 (Marion County)
Today
Data Solutions Analyst
$70K — $95K *
Liberty Mutual
Boston, MA 02115 (Suffolk County)
Today
Security Engineer, Penetration Testing
$90K — $130K *
(isc)2
Remote
Today
Manager, QA & Automation
$100K — $130K *
(isc)2
Remote
Today

Find similar Senior ML Infrastructure Engineer jobs:

Nationwide Edison, NJ

Senior ML Infrastructure Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior ML Infrastructure Engineer jobs:

Get Ready For Your
Next Interview