Gem.com

Software Engineer, ML Platform

Gem.com$187K — $395K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of engineering experience with a strong background in Python and distributed systems architecture.
  • Extensive experience building and managing large-scale systems focused on queues and scheduling.
  • Deep expertise in Linux, Docker, and Kubernetes as core infrastructure tools.
  • Proficient with Redis and S3-compatible storage, as well as public cloud platforms, particularly AWS.

Responsibilities

  • Architect end-to-end model serving pipelines integrating new architectures into a high-throughput inference engine.
  • Build robust scheduling systems for optimal job management across GPU clusters.
  • Design traffic-based systems for hotswapping models on GPU workers to maximize resource efficiency.
  • Own end-to-end CI/CD pipelines and manage resilient artifact stores for model checkpoints.
  • Develop user-friendly APIs that enable rapid feature delivery for product and research teams.
  • Manage and optimize inference workloads at scale across multiple clusters.

Benefits

  • Opportunity to work on foundational infrastructure impacting millions of users.
  • Massive ownership and influence in a 0-to-1 environment, not just maintenance.
  • Ability to collaborate closely with research teams on cutting-edge AI models.
Full Job Description
Where You Come In

This is a rare opportunity to build the foundational infrastructure that powers our large-scale multimodal models. We believe that reliable, high-performance infrastructure is the single biggest differentiating factor between success and failure in achieving our mission. You will be a foundational member of the team, designing the critical systems that allow us to train and serve next-generation AI to millions of users.

What You'll Do

This is a 0-to-1 opportunity, not a maintenance role. You will have massive ownership to:

  • Architect end-to-end model serving pipelines and integrate new model architectures from our research team into our core, high-throughput inference engine.
  • Build robust and sophisticated scheduling systems to manage jobs based on cluster availability and user priority, ensuring we optimally leverage thousands of expensive GPU resources.
  • Design and implement dynamic, traffic-based systems for hotswapping models on our GPU workers to maximize fleet efficiency and meet product SLOs.
  • Own the end-to-end CI/CD pipelines, including creating a resilient artifact store to manage all model checkpoints across multiple versions and providers.
  • Develop and maintain user-friendly APIs and interaction patterns that empower our product and research teams to ship groundbreaking features at high velocity.
  • Manage and optimize our complex inference workloads at scale, operating across multiple clusters and hardware providers.


Who You Are

We are looking for a world-class builder who has a proven history of creating and managing large-scale, high-performance systems. You are a non-negotiable fit if you have:

  • 5+ years of professional engineering experience with deep, hands-on proficiency in Python and complex distributed systems architecture.
  • Extensive, practical experience building and managing systems at scale, specifically with queues, scheduling, traffic-control, and fleet management.
  • Deep expertise in our core infrastructure stack: Linux, Docker, and Kubernetes.
  • Strong experience with Redis, S3-compatible storage, and public cloud platforms (AWS).


What Sets You Apart (Bonus Points)

You'll stand out as an exceptional candidate if you also bring:

  • Experience with high-performance, large-scale ML systems (managing >100 GPUs).
  • Deep familiarity with PyTorch and CUDA.
  • Experience with modern networking stacks, including RDMA (RoCE, Infiniband, NVLink).
  • Familiarity with FFmpeg and multimedia processing pipelines.


Compensation

The base pay range for this role is $187,500 - $395,000 per year.

About Gem.com

Industry
Founded
2013

Similar Jobs

More Jobs at Gem.com

More Information Technology Jobs

Find similar Software Engineer, ML Platform jobs: