Gem.com

Software Engineer, ML Platform

Gem.com$187K — $395K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of professional engineering experience in large-scale systems
  • Hands-on proficiency in Python and complex distributed systems architecture
  • Experience building and managing systems with queues and scheduling
  • Expertise in Linux, Docker, and Kubernetes
  • Familiarity with Redis and AWS cloud platforms

Responsibilities

  • Architect end-to-end model serving pipelines and integrate new architectures
  • Build sophisticated scheduling systems to manage GPU resources efficiently
  • Design dynamic systems for hotswapping models to maximize efficiency
  • Own end-to-end CI/CD pipelines and manage model checkpoints
  • Develop user-friendly APIs to support product and research teams
  • Manage and optimize inference workloads across multiple clusters

Benefits

  • Opportunity to work on foundational infrastructure for next-generation AI
  • Massive ownership in a 0-to-1 development environment
  • Engagement with cutting-edge technologies
  • Collaborative culture with high-impact teammates
  • Exposure to large-scale AI model serving
Full Job Description
Where You Come In

This is a rare opportunity to build the foundational infrastructure that powers our large-scale multimodal models. We believe that reliable, high-performance infrastructure is the single biggest differentiating factor between success and failure in achieving our mission. You will be a foundational member of the team, designing the critical systems that allow us to train and serve next-generation AI to millions of users.

What You'll Do

This is a 0-to-1 opportunity, not a maintenance role. You will have massive ownership to:

  • Architect end-to-end model serving pipelines and integrate new model architectures from our research team into our core, high-throughput inference engine.
  • Build robust and sophisticated scheduling systems to manage jobs based on cluster availability and user priority, ensuring we optimally leverage thousands of expensive GPU resources.
  • Design and implement dynamic, traffic-based systems for hotswapping models on our GPU workers to maximize fleet efficiency and meet product SLOs.
  • Own the end-to-end CI/CD pipelines, including creating a resilient artifact store to manage all model checkpoints across multiple versions and providers.
  • Develop and maintain user-friendly APIs and interaction patterns that empower our product and research teams to ship groundbreaking features at high velocity.
  • Manage and optimize our complex inference workloads at scale, operating across multiple clusters and hardware providers.


Who You Are

We are looking for a world-class builder who has a proven history of creating and managing large-scale, high-performance systems. You are a non-negotiable fit if you have:

  • 5+ years of professional engineering experience with deep, hands-on proficiency in Python and complex distributed systems architecture.
  • Extensive, practical experience building and managing systems at scale, specifically with queues, scheduling, traffic-control, and fleet management.
  • Deep expertise in our core infrastructure stack: Linux, Docker, and Kubernetes.
  • Strong experience with Redis, S3-compatible storage, and public cloud platforms (AWS).


What Sets You Apart (Bonus Points)

You'll stand out as an exceptional candidate if you also bring:

  • Experience with high-performance, large-scale ML systems (managing >100 GPUs).
  • Deep familiarity with PyTorch and CUDA.
  • Experience with modern networking stacks, including RDMA (RoCE, Infiniband, NVLink).
  • Familiarity with FFmpeg and multimedia processing pipelines.


Compensation

The base pay range for this role is $187,500 - $395,000 per year.

About Gem.com

Industry
Founded
2013

Similar Jobs

More Jobs at Gem.com

More Information Technology Jobs

Find similar Software Engineer, ML Platform jobs: