Quantiphi

Technical Architect - Machine Learning

Quantiphi$120K — $160K *
US-AnywhereRemote in United States
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 6-8 years of hands-on experience in machine learning and AI engineering with a strong production background
  • Proven expertise in building multi-agent systems and agentic workflows, preferably with Langraph/CrewAI
  • Expert-level Python proficiency alongside experience with ML frameworks like TensorFlow and PyTorch
  • Hands-on experience with vector databases (Pinecone, Weaviate, ChromaDB) and scalable RAG systems
  • Production-level experience with major cloud platforms (AWS, GCP, or Azure) and related services

Responsibilities

  • Architect and build autonomous multi-agent systems from scratch
  • Engineer advanced capabilities for agents and develop custom tools for complex tasks
  • Implement context engineering to ensure agents maintain learning and state
  • Own the deployment and maintenance of agentic systems on cloud platforms
  • Integrate and optimize LLMs for enhanced performance of autonomous agents
  • Create and manage comprehensive tool libraries for agent interactions
  • Implement monitoring systems to evaluate agent behavior and system performance

Benefits

  • Collaborative work environment with enthusiastic team members
  • Opportunities for professional growth and mentorship
  • Flexible work location within the US or Canada
  • Engagement in cutting-edge AI research and application
  • Impactful work in shaping the future of autonomous systems
Full Job Description
Role: Technical Architect Machine Learning Engineer - Agentic AI & Multi-Agent Systems
Experience Level: 8-12 years
Location: US / Canada

Job Summary:

We are seeking an experienced Senior Machine Learning Engineer to architect, build, and deploy production-grade agentic AI systems and multi-agent workflows from the ground up. The ideal candidate will have deep expertise in designing autonomous AI systems that can collaborate, reason, and execute complex tasks with minimal human intervention. You will be responsible for creating scalable, robust agentic workflows using cutting-edge frameworks like CrewAI/Langraph, while ensuring enterprise-grade deployment on major cloud platforms.

Roles & Responsibilities:

Agentic System Architecture & Development:
  • Architect & Build Agentic Systems: Design and develop end-to-end multi-agent systems from scratch. You will create the foundational agent harnesses, define communication protocols, and build orchestration layers using frameworks like CrewAI, Langgraph, and AutoGen. Architectural decisions to ensure:
    • Hierarchical and collaborative multi-agent structures with well-defined agent roles, responsibilities, and communication protocols
    • Dynamic task decomposition, sophisticated tool integration, planning mechanisms (ReAct), and self-correction loops
    • Develop state management systems and memory mechanisms for persistent agent interactions
  • Engineer Advanced Agent Capabilities: Develop custom agent-tools and define specialized agent-skills that empower agents to perform complex, domain-specific tasks.
  • Pioneer Context Engineering: Implement advanced context engineering and memory systems to ensure agents maintain state, learn from interactions, and make informed decisions in dynamic environments.
  • Deploy Production-Grade Solutions: Own the deployment, scaling, and maintenance of robust, low-latency agentic systems on major cloud platforms (GCP, AWS, or Azure). You will implement best-in-class MLOps practices for monitoring, continuous integration/continuous deployment (CI/CD), and system reliability.
  • Integrate and Optimize LLMs: Integrate LLMs to serve as the core reasoning engines for autonomous agents. You will apply advanced techniques like RAG and PEFT to optimize performance.

Tool Development & RAG Integration:
  • Create and maintain comprehensive tool libraries for agents including API integrations, database queries, and external service connections
  • Design and implement RAG systems using vector databases (Pinecone, Weaviate, ChromaDB)
  • Develop custom tools and plugins that enable agents to interact with various enterprise systems and APIs
  • Ensure tool reliability, error handling, and seamless integration within agentic workflows

Observability, Monitoring & Evaluation:
  • Implement comprehensive monitoring and tracing systems for agent behavior, performance, cost optimization, and latency analysis
  • Design novel evaluation frameworks to assess multi-step agentic task success, reliability, and accuracy
  • Utilize advanced observability tools (LangSmith, Arize AI, or custom solutions) to trace agent decision making processes
  • Establish metrics and KPIs for measuring agentic system performance in production environments

Required Skills & Qualifications:

Experience:
  • 6-8 years of hands on experience in machine learning and AI engineering with proven track record of taking ML systems to production
  • Demonstrated expertise in building multi-agent systems and agentic workflows, preferably with Langraph/CrewAI

Technical Skills - Must Have:
  • Programming & ML: Expert-level Python proficiency with ML frameworks (TensorFlow, PyTorch, Transformers). Experience with FastAPI, async programming, and microservices architecture
  • Data & Vector Systems: Hands-on experience with vector databases (Pinecone, Weaviate, ChromaDB) and building scalable RAG systems
  • Monitoring & Observability: Experience with LLM application monitoring tools (LangSmith, Weights & Biases, custom telemetry solutions)
  • Proven ability to architect and implement complex AI systems from scratch in production environments
  • Cloud Platform Expertise: Production-level experience with at least one major cloud platform (AWS, GCP, or Azure), including:
    • Compute services (EC2, GCE, Azure VMs)
    • Serverless functions (Lambda, Cloud Functions, Azure Functions)
    • Container orchestration (EKS, GKE, AKS)
    • Managed AI/ML services (SageMaker, Vertex AI, Azure ML)
  • Production & DevOps: Strong skills in Infrastructure as Code (Terraform, CloudFormation), CI/CD pipelines (GitHub Actions, Jenkins), and containerization (Docker, Kubernetes)

Technical Skills - Good to have:
  • Experience with prompt engineering techniques, fine-tuning SLMs (PEFT, SFT, RLHF), and model optimization
  • Knowledge of distributed systems, message queues, and event-driven architectures for agent coordination
  • Familiarity with SDLC best practices, version control (Git), and agile development methodologies
  • Experience with tool-calling agents, multi-step workflows, and stateful orchestration (e.g. graphs, planners, routers).
  • Hands-on evals for agents: trajectory / tool-use checks, golden traces, LLM-as-judge with fixed rubrics, regression suites.
  • Online evals, drift thinking, and clear quality gates before or after deploy (thresholds, alerts, rollback criteria).
  • Safety and abuse: prompt injection via tools, untrusted retrieval, PII handling in prompts and logs, allowlists and guardrails.
  • Cost and latency discipline: budgets per run, timeouts, caps on turns and tool calls.
  • Model lifecycle: routing / gateway patterns, version pinning, fallbacks, and which model for which step.
  • Memory and state: what is persisted, retention, redaction, and what must never be stored

Soft Skills:
  • Exceptional problem-solving and analytical thinking with ability to tackle complex, ambiguous challenges
  • Strong communication skills to explain complex agentic concepts to both technical and non-technical stakeholders
  • Proven ability to work independently and drive large-scale projects to completion with minimal supervision
  • Leadership mindset with experience mentoring team members and driving technical excellence


If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

About Quantiphi

Quantiphi is an artificial intelligence and machine learning services company that helps businesses transform their operations through the use of AI. The company provides a range of services, including data engineering, machine learning, computer vision, natural language processing, and predictive analytics. Quantiphi was founded in 2013 and is headquartered in King of Prussia, Pennsylvania.
Learn more about Quantiphi
Size
500 employees
Industry
Founded
2013

Similar Jobs

More Jobs at Quantiphi

More Information Technology Jobs

Find similar Technical Architect - Machine Learning jobs: