MLE (General Training Infrastructure)

NOUS RESEARCH

$130K — $180K *
Enterprise Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX proficiency for developing efficient training code
  • Experience with multi-node, multi-GPU setups and debugging
  • Familiarity with distributed training frameworks and cluster management
  • Deep insights into GPU memory management and optimization techniques

Responsibilities

  • Engineer performance for training infrastructure focused on large language models
  • Implement parallelization strategies across various dimensions
  • Profile distributed training processes and eliminate performance bottlenecks
  • Develop robust fault-tolerant training systems with recovery mechanisms

Benefits

  • Opportunity to work on cutting-edge technology in AI and machine learning
  • Collaborative environment with a focus on innovation and continuous improvement
  • Access to resources for professional development and upskilling
  • Flexible work arrangements to support work-life balance
  • Potential to contribute to impactful open-source projects
Full Job Description
We're looking for an MLE to scale training of large transformer-based models. You'll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.

Responsibilities:
  • Performance engineering of training infrastructure for large language models
  • Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
  • Profiling distributed training runs and optimizing performance bottlenecks
  • Building fault-tolerant training systems with checkpointing and recovery mechanisms

Qualifications:
  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX for performant and fault-tolerant training code
  • Multi-node, multi-GPU training experience with debugging skills
  • Experience with distributed training frameworks and cluster management
  • Deep understanding of GPU memory management and optimization techniques

Preferred:
  • Experience with distributed training of large multi-modal models, including those with separate vision encoders.
  • Deep knowledge of NCCL (e.g. symmetric memory)
  • Experience with mixture of experts architectures and expert parallelism
  • Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
  • Custom CUDA kernel development for training operations
  • Proven ability to debug training instability and numerical issues
  • Experience designing test runs to de-risk large-scale optimizations
  • Hands-on experience with FP8 or FP4 training
  • Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)

Similar Jobs

More Jobs at NOUS RESEARCH

  • Full Stack Engineer
    $100K — $150K *
    New York, NY 10025 (New York County)
    Information Technology
    In-Person
  • UI/UX Designer
    $80K — $120K *
    New York, NY 10025 (New York County)
    Consumer Technology
    In-Person
  • Forward Deployed Engineer
    $100K — $150K *
    New York, NY 10025 (New York County)
    Information Technology
    In-Person
  • MLE (General Training Infrastructure)
    $130K — $180K *
    New York, NY 10025 (New York County)
    Information Technology
    In-Person
  • General Counsel
    $180K — $250K *
    New York, NY 10025 (New York County)
    Legal & Accounting
    In-Person

More Enterprise Technology Jobs

Find similar MLE (General Training Infrastructure) jobs: