MLE (General Training Infrastructure)

NOUS RESEARCH

• $130K — $180K *

New York, NY 10025In-Person

Enterprise Technology

Less than 5 years of experience

More than 3 months ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

3+ years training large neural networks in production
Expert-level PyTorch or JAX proficiency for developing efficient training code
Experience with multi-node, multi-GPU setups and debugging
Familiarity with distributed training frameworks and cluster management
Deep insights into GPU memory management and optimization techniques

Responsibilities

Engineer performance for training infrastructure focused on large language models
Implement parallelization strategies across various dimensions
Profile distributed training processes and eliminate performance bottlenecks
Develop robust fault-tolerant training systems with recovery mechanisms

Benefits

Opportunity to work on cutting-edge technology in AI and machine learning
Collaborative environment with a focus on innovation and continuous improvement
Access to resources for professional development and upskilling
Flexible work arrangements to support work-life balance
Potential to contribute to impactful open-source projects

Full Job Description

We're looking for an MLE to scale training of large transformer-based models. You'll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.

Responsibilities:

Performance engineering of training infrastructure for large language models
Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
Profiling distributed training runs and optimizing performance bottlenecks
Building fault-tolerant training systems with checkpointing and recovery mechanisms

Qualifications:

3+ years training large neural networks in production
Expert-level PyTorch or JAX for performant and fault-tolerant training code
Multi-node, multi-GPU training experience with debugging skills
Experience with distributed training frameworks and cluster management
Deep understanding of GPU memory management and optimization techniques

Preferred:

Experience with distributed training of large multi-modal models, including those with separate vision encoders.
Deep knowledge of NCCL (e.g. symmetric memory)
Experience with mixture of experts architectures and expert parallelism
Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
Custom CUDA kernel development for training operations
Proven ability to debug training instability and numerical issues
Experience designing test runs to de-risk large-scale optimizations
Hands-on experience with FP8 or FP4 training
Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)

* Ladders Estimates

Similar Jobs

Senior Machine Learning Engineer
$138K — $181K *
FanDuel
New York City, NY 10025 (New York County)
Reposted Today
VLM Engineer
$120K — $150K *
NewGen
Herndon, VA 20171 (Fairfax County)
Today
Senior Machine Learning Engineer
$150K — $185K *
Rockstar Games
Andover, MA 01810 (Essex County)
3 days ago
Senior Machine Learning Engineer
$160K — $195K *
Rockstar Games
Manhattan, NY 10020 (New York County)
3 days ago
Senior ML Observability Engineer
$120K — $160K *
ECS
Washington, DC 20310 (District Of Columbia County)
3 days ago
Senior ML Observability Engineer
$120K — $150K *
ECS
Fairfax, VA 22031 (Fairfax County)
3 days ago

Get Ready For Your
Next Interview

More Jobs at NOUS RESEARCH

Full Stack Engineer
$100K — $150K *
New York, NY 10025 (New York County)
2 weeks ago
Information Technology
In-Person
UI/UX Designer
$80K — $120K *
New York, NY 10025 (New York County)
2 weeks ago
Consumer Technology
In-Person
Forward Deployed Engineer
$100K — $150K *
New York, NY 10025 (New York County)
2 weeks ago
Information Technology
In-Person
MLE (General Training Infrastructure)
$130K — $180K *
New York, NY 10025 (New York County)
1 month ago
Information Technology
In-Person
General Counsel
$180K — $250K *
New York, NY 10025 (New York County)
1 month ago
Legal & Accounting
In-Person

More Enterprise Technology Jobs

AI Enablement Specialist
$100K — $115K *
Axis Communications
Chelmsford, MA 01824 (Middlesex County)
Today
Configurator Developer Engineer (Oracle CPQ)
$85K — $110K *
Nidec Automatic Feed
St. Louis, MO 63129 (Saint Louis County)
Today
Manager, SAP SD Public Cloud
$100K — $130K *
KPMG
Calgary, AB T1Y 7M8
Today
Sr. ERP Developer
$160K — $165K *
Cape Cod Healthcare
Hyannis, MA 02601 (Barnstable County)
Today
Technical Program Manager - Engineering Systems Integration
$105K — $180K *
KLA Tencor
Ann Arbor, MI 48103 (Washtenaw County)
Reposted Today

Find similar MLE (General Training Infrastructure) jobs:

Nationwide New York, NY

MLE (General Training Infrastructure)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar MLE (General Training Infrastructure) jobs:

Get Ready For Your
Next Interview