Snowflake Computing

Senior Software Engineer - LLM Post-Training Platform

Snowflake Computing$130K — $180K *
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience in building and shipping production ML systems
  • Strong foundation in distributed systems and infrastructure, especially with Kubernetes
  • Familiarity with GPU and LLM infrastructure including PyTorch and DeepSpeed
  • Proven track record of enhancing system reliability and efficiency
  • Bachelor's degree in Computer Science or related field, with advanced degrees preferred
  • Hands-on experience with LLM post-training is a plus

Responsibilities

  • Design and build from public training APIs and SDK to the GPU data plane
  • Scale serverless GPU compute with multi-tenant scheduling and fault tolerance
  • Enhance performance at scale to maintain responsiveness under load
  • Collaborate with Snowflake Research to productionize cutting-edge ML techniques
  • Ensure GPUs are efficiently used during training and inference processes

Benefits

  • Opportunities for rapid career growth in a fast-paced environment
  • Collaborative team culture emphasizing innovation
  • A chance to work on cutting-edge ML technology for large-scale applications
  • Mentorship and support from experts in the field
  • Exposure to advanced ML and AI research initiatives
Full Job Description
Senior Software Engineer - LLM Post-Training Platform

The Snowflake ML Platform team's mission is to let customers run their most demanding ML/AI workloads inside Snowflake. Cortex Training is our LLM post-training platform: it turns scarce, expensive GPU capacity into a simple, composable service, so customers can adapt open-weight foundation models to their own business problems while we handle the hard distributed-systems parts, including scheduling, orchestration, multi-node training and inference, fault tolerance, and throughput.

The platform already runs post-training at scale. Under the hood, it decouples GPU computation from the training loop and exposes it as primitive APIs that compose into everything from SFT to full RL workflows. You'll work alongside a team that ships fast & sweats reliability and the researchers behind DeepSpeed. We're looking for an engineer who thrives in the ML infrastructure layer and brings a solid understanding of LLMs and post-training to help us scale and grow it.

YOU WILL:
  • Design and build across the full stack - from the public training APIs and SDK through the control plane to the GPU data plane.
  • Scale the distributed systems that make GPU compute serverless - multi-tenant scheduling, placement, and capacity-aware routing across regional GPU pools, with fault tolerance built in.
  • Drive end-to-end performance at scale - keep the training, inference, and RL loops fast and the data plane responsive under heavy concurrent load, with GPUs kept saturated.
  • Productionize research building blocks - partner with Snowflake Research to turn state-of-the-art training and inference techniques into reliable, composable components customers can run at enterprise scale.
QUALIFICATIONS:
  • 5+ years building and shipping production ML systems
  • Strong distributed systems and infrastructure foundation - designing scalable, fault-tolerant services and operating them on Kubernetes in production.
  • Familiarity with GPU and LLM infrastructure - e.g., PyTorch, DeepSpeed/FSDP, Ray, CUDA/NCCL, vLLM; able to debug across the data, infrastructure, and GPU layers.
  • Demonstrated ability to harden complex systems for reliability, throughput, and cost efficiency.
  • BS in Computer Science or a related field (MS/PhD a plus).
  • (Bonus) Hands-on LLM post-training / modeling experience - the strongest candidates pair deep infra skills with real post-training intuition.

Snowflake is growing fast, and we're scaling our team to help enable and accelerate our growth. We are looking for people who share our values, challenge ordinary thinking, and push the pace of innovation while building a future for themselves and Snowflake.

How do you want to make your impact?

For jobs located in the United States, please visit the job posting on the Snowflake Careers Site for salary and benefits information: careers.snowflake.com

About Snowflake Computing

Snowflake is a cloud-based data-warehousing company that was founded in 2012. The company provides a data platform that allows customers to store and analyze data using cloud-based infrastructure. Snowflake's platform is designed to be highly scalable and flexible, allowing customers to easily add or remove computing resources as needed. The company's customers include a wide range of businesses, from startups to Fortune 500 companies. Snowflake has received significant funding from investors and has been recognized as one of the fastest-growing companies in the United States.
Learn more about Snowflake Computing
Size
2,037 employees
Market Cap
$44.9 billion
Industry
Net Income
-$539.1 million
Founded
2012
Revenue
$592 million
NASDAQ

Similar Jobs

More Jobs at Snowflake Computing

More Enterprise Technology Jobs

Find similar Senior Software Engineer - LLM Post-Training Platform jobs: