We're looking for a Research Engineer to build and maintain the training platform that powers our model development - experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.
What You'll Do- Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters
- Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management
- Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection
- Optimize and automate the research iteration loop from experiment launch to results analysis
- Manage job scheduling and cluster utilization for efficient use of GPU compute
- Build internal tooling and interfaces that help researchers move faster
- Collaborate with training systems, data infrastructure, and research teams to support their platform needs
What We're Looking For- Strong software engineering skills with experience in MLOps or ML platform engineering
- Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)
- Experience building experiment tracking, reproducibility, and artifact management systems
- Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)
- Strong reliability engineering instincts: monitoring, alerting, and failure recovery
Nice to Have (But Not Required)- Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)
- Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)
- Experience supporting large model training pipelines (LLMs, VLMs, or video models)
- Understanding of parallelism strategies and how they affect training efficiency and debugging
- Experience with cloud-based training infrastructure (AWS, GCP, or Azure)
Why This Role- Your platform is the daily tool every researcher and engineer uses to train models
- Improvements to training velocity and reliability compound across every experiment the team runs
- High visibility with direct feedback from researchers and ML engineers
- Build systems that scale from today's models to future frontier training runs