Title: Backend ML Engineer
Reports to: Senior Software Architect
Location: North Sioux City, SD
Job Description: We are looking for a Backend ML Engineer who is interested in taking AI/ML systems from prototype to production, designing inference APIs, building retrieval and orchestration pipelines, integrating large language models, and operating ML infrastructure at scale. If you thrive in a collaborative, client-focused environment and enjoy shipping AI features that real users depend on, we'd love to have you on our team.
Required Technical Skills:- 3-5 years of experience in backend or ML engineering
- Strong working knowledge of Python, including FastAPI or Flask
- Experience with modern ML libraries such as PyTorch, Hugging Face Transformers, and sentence-transformers
- Proficiency with cloud platforms including AWS, GCP, or Azure
- Hands-on experience integrating LLMs (OpenAI, Anthropic, Gemini, or open-source models) into production systems
- Familiarity with vector databases such as Weaviate, pgvector, Pinecone, or similar
- Experience with retrieval-augmented generation (RAG) patterns
- Self-motivated with a positive and professional attitude
- Knowledge of additional languages such as Node.js, JavaScript, or other relevant languages is a plus
Required Education/Experience:
- Bachelor's degree in Computer Science, Machine Learning, or a related field (minimum requirement), or equivalent practical experience
- Graduate-level coursework or specialization in ML/AI is a plus
- Relevant cloud certifications are a plus
- Demonstrated experience shipping ML systems to production is a plus
- US DoD Clearance preferred or willingness to obtain such
Qualifications:- Strong experience building backend services with Python (FastAPI/Flask); comfort working with async APIs and request/response patterns for ML inference workloads.
- Hands-on experience integrating LLMs and embedding models into production applications, including prompt engineering, context management, and handling rate limits, retries, and streaming responses.
- Familiarity with RAG architectures: chunking strategies, embedding pipelines, vector search, reranking, and evaluation metrics (Recall[redacted], MRR, faithfulness, answer relevance).
- Experience with vector databases (Weaviate, pgvector, Pinecone, Qdrant, or similar) and traditional databases (PostgreSQL, MariaDB) for hybrid retrieval and metadata filtering.
- Cloud experience (AWS/GCP/Azure) for deploying ML services - including managed inference endpoints, GPU instances, or serverless model hosting.
- Strong understanding of API authentication, secure handling of model inputs/outputs, and PII/PHI-aware design where applicable.
- Experience with ML observability: tracking latency, token usage, cost-per-query, retrieval quality, and model drift in production.
- Background in data pipelines, document ingestion/parsing, or evaluation frameworks (Ragas, TruLens, Docling, custom harnesses) is needed.
- Familiarity with fine-tuning, LoRA/PEFT, or model distillation is appreciated.
- Experience with MLOps tooling (MLflow, Weights & Biases, Kubeflow) or LLM orchestration frameworks (LangChain, LlamaIndex, Haystack, or custom orchestrators) is a plus.
Responsibilities:- Build, test, and maintain production ML services - inference APIs, retrieval pipelines, orchestration layers, and guardrail/evaluation components.
- Design scalable RESTful and streaming APIs that serve ML model outputs reliably under real-world load.
- Integrate and tune LLMs, embedding models, and rerankers; evaluate trade-offs across hosted (Anthropic, OpenAI, Vertex) and self-hosted (HF, vLLM) options on cost, latency, and quality.
- Build ingestion and chunking pipelines for unstructured data (PDFs, HTML, transcripts) and maintain vector store schemas for multi-tenant or multi-domain retrieval.
- Implement evaluation harnesses to measure retrieval quality, generation faithfulness, and end-to-end answer correctness; close the loop from evals back into pipeline improvements.
- Containerize and deploy ML workloads with Docker and Kubernetes; manage GPU/CPU resource allocation and model versioning.
- Optimize database queries, vector search performance, and caching strategies (including LLM prompt caching) to reduce latency and cost.
- Implement CI/CD pipelines for ML services and instrument monitoring for both system metrics (latency, error rate) and ML-specific metrics (retrieval quality, hallucination rate, drift)
- Collaborate with frontend engineers, ML researchers, and product analysts to translate model capabilities into shipped features.
- Document backend and ML infrastructure, including model cards, evaluation results, and architectural decisions
- Travel - must be willing to travel 25% and periodically up to 50%.