Orion Innovation

Senior ML Infrastructure Engineer

Orion Innovation$120K — $160K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience with Kubernetes and Azure Kubernetes Service (AKS)
  • Proficient in configuring and managing GPU node pools and CUDA
  • Strong background in Python for production-level applications
  • Familiarity with Hugging Face Transformers and state-of-the-art NLP models
  • Experience with Infrastructure as Code (IaC) using Terraform or Bicep
  • Ability to work with Azure SDK and manage Azure infrastructure
  • Willingness to learn new technologies is a plus

Responsibilities

  • Configure and manage GPU clusters for optimal performance
  • Design and implement Kubernetes jobs and autoscaling policies
  • Debug and resolve CUDA runtime issues for machine learning models
  • Oversee the end-to-end model serving process on Kubernetes
  • Integrate with Azure services like Data Lake Storage and OpenAI
  • Implement memory management strategies for efficient resource usage
  • Conduct performance profiling to enhance system capabilities

Benefits

  • Opportunity to work with cutting-edge technology in a dynamic environment
  • Flexible working hours to promote work-life balance
  • Access to professional development resources to enhance skills
  • Collaborative team culture encouraging innovation and creativity
  • Exposure to large-scale project management in a cloud environment
Full Job Description
Project Overview:

We're building a large-scale document intelligence platform that processes text files up to 5 TB in size, extracts insights using BERT-class NLP models, and surfaces answers to analysts via a low-latency query interface. The platform runs on Azure Kubernetes Service (AKS) with dedicated GPU node pools, uses KEDA for event-driven autoscaling, and integrates with Azure Data Lake Storage Gen2 and Azure OpenAI.

This is a hands-on role that sits at the intersection of platform engineering and applied ML, and requires someone who is equally comfortable debugging a CUDA out-of-memory error and designing a Kubernetes autoscaling policy. As the Senior ML Infrastructure Engineer the resource will own the end-to-end infrastructure layer - from GPU cluster configuration and CUDA runtime management to Kubernetes job orchestration and model serving.

Skill / Technology:
  • Level: Kubernetes / AKS
  • Expert: Multi-node-pool design, taint/toleration, autoscaler, GPU node pools (NC/ND series)
  • Senior: Device plugin, driver compat, resource limits, KEDA
  • Senior: Scaled Job, queue triggers, cooldown tuning, CUDA / cuDNN
  • Mid-Senior: Runtime config via PyTorch; raw kernel dev not required, PyTorch (GPU inference)
  • Senior: Batching, FP16, memory management, profiling, Hugging Face Transformers
  • Senior: BERT/DistilBERT/BGE loading, pipeline API, tokenization, Python (production)
  • Senior: Async workers, Azure SDK, queue consumers, Azure infrastructure
  • Senior: VNet, private endpoints, Key Vault, ADLS, AD, Docker / Helm
  • Senior: Multi-stage builds, Helm chart authoring, IaC (Terraform / Bicep)
  • Preferred: willingness to learn is acceptable


About Orion Innovation

Orion Innovation is a global technology services firm that provides IT solutions and services to businesses across various industries. The company offers digital transformation, product engineering, data analytics, cloud computing, and other IT services to help businesses improve their operations and achieve their goals. Orion Innovation has offices in the United States, Europe, and Asia, and serves clients in industries such as financial services, healthcare, retail, and more.
Learn more about Orion Innovation
Size
4,000 employees
Industry

Similar Jobs

More Jobs at Orion Innovation

  • Orion Innovation
    Senior ML Infrastructure Engineer
    $120K — $160K *
    Edison, NJ 08817 (Middlesex County)
    Information Technology
    In-Person
  • Orion Innovation
    AI Engineer
    $100K — $150K *
    Edison, NJ 08817 (Middlesex County)
    Technical Services
    In-Person
  • Orion Innovation
    Devops Manager
    $120K — $150K *
    Edison, NJ 08817 (Middlesex County)
    Information Technology
    In-Person
  • Orion Innovation
    Optical Network Engineer
    $90K — $130K *
    Dallas, TX 75217 (Dallas County)
    Telecommunications & Hardware
    In-Person

More Information Technology Jobs

Find similar Senior ML Infrastructure Engineer jobs: