Walmart

(USA) Distinguished, Software Engineer-AI/ML Engineer - Agentic Systems & Site Reliability Engineering

Walmart$169K — $338K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's/Master's degree in Engineering, Computer Science, or related field
  • 12+ years of experience in Site Reliability Engineering, AI/ML Engineering, or Platform Engineering
  • Proven senior-level contributor in SRE or AI/ML
  • Expertise in mission-critical systems with a focus on KPIs like MTTD and MTTR
  • Familiarity with advanced machine learning algorithms and production deployment at scale

Responsibilities

  • Architect and develop advanced agentic AI systems for reliability engineering workflows
  • Design and implement multi-agent orchestration platforms for automated incident response
  • Build intelligent observability systems using ML-driven anomaly detection
  • Develop self-healing infrastructure platforms for proactive issue resolution
  • Collaborate with engineering teams to improve MTTD and MTTR through automation
  • Perform complex troubleshooting of distributed systems across Walmart’s technology stack
  • Drive the development of MLOps and AIOps platforms for continuous optimization

Benefits

  • Medical, vision, and dental coverage
  • 401(k) and stock purchase options
  • Paid time off including parental and family care leave
  • Short-term and long-term disability options
  • Educational benefits covering tuition and fees for associates
Full Job Description
Position Summary...
As a Distinguished AI/ML Engineer within Walmart Global Tech's Site Reliability Engineering organization, you will lead the technical development of next-generation agentic AI systems and intelligent automation solutions that ensure mission-critical reliability, scalability, and operational excellence across Walmart's entire technology ecosystem. You will architect and implement cutting-edge machine learning platforms and autonomous agents that revolutionize how we monitor, predict, and automatically resolve issues across all of Walmart's systems, supporting millions of Associates and customers globally.

Walmart Global Tech's Site Reliability Engineering organization is built with hybrid systems and software engineers who take technical ownership for reliability, scalability, automation, and mission-critical issues related to uptime, availability and fast rate of improvement of Walmart's e-commerce, stores, and omni-channel platform. As a technical expert in this domain, you'll drive the transformation of traditional SRE practices into AI-powered, self-healing, and autonomous systems built on modern tech stacks with intelligent capacity management and predictive performance optimization.

You'll be responsible for designing and building Tier 0 high-availability, resilient agentic platforms that serve as the backbone for reliability engineering across all of Walmart's systems, stores and facilities across US and international markets while defining and implementing unified, intelligent, operationally robust technical solutions and tools for all Walmart Technology organizations across all channels and geographies.

What you'll do...
AI/ML & Agentic Systems Technical Leadership:
  • Architect and develop advanced agentic AI systems that can autonomously handle complex reliability engineering workflows, predictive failure analysis, and self-optimization across all Walmart technology systems.
  • Design and implement multi-agent orchestration platforms that coordinate between different AI agents for automated incident response, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
  • Build intelligent observability and monitoring systems using ML-driven anomaly detection, predictive analytics, and autonomous incident resolution capabilities that span all of Walmart's technology ecosystem.
  • Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically resolve system issues before they impact customers, associates, or business operations across any Walmart system.
Site Reliability Engineering Technical Excellence:
  • Design, write and build advanced tools to improve reliability, latency, availability, and scalability of all Walmart Tech systems including: 1) Engineer reliability and availability starting with metrics and measurements across all domains, 2) Enable scaling by providing technical solutions, developing automation and/or optimizing processes for all engineering teams, 3) Build tools/automate to prevent re-occurrence of problems across all mission critical Walmart services, 4) Augment existing instrumentation to build a cohesive picture of system characteristics across the entire Walmart technology landscape with special attention to points of failure.
  • Architect and implement fault-tolerant systems and services across Walmart's hybrid cloud infrastructure with focus on autonomous recovery and intelligent failure prediction for e-commerce, supply chain, financial services, and in-store technology.
  • Collaborate with engineering teams and leadership across all Walmart technology organizations to establish technical strategies and solutions to improve mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities.
  • Work with service owners across all domains (e-commerce, supply chain, stores, fintech, etc.) to define SLOs and build SLIs to ensure all critical systems are meeting SLAs while maintaining optimal performance and user experience.
  • Perform complex troubleshooting and analysis of large-scale distributed systems across Walmart's entire technology stack, using expertise in coding, algorithms, and distributed system design.

Strategic Technical Innovation:
  • Partner closely with all engineering organizations including E-commerce, Supply Chain, Store Technology, Fintech, and Data Platform teams to deliver autonomous reliability solutions through advanced machine learning, natural language processing, and computer vision technologies.
  • Drive the development of MLOps and AIOps platforms that enable continuous learning, model deployment, monitoring, and autonomous optimization of reliability engineering systems across all Walmart domains.
  • Innovate in agentic AI technologies for SRE including large language models (LLMs) for automated incident response, reinforcement learning agents for capacity optimization, multi-modal AI for infrastructure monitoring, and federated learning for cross-domain reliability insights.
  • Implement advanced CI/CD pipelines for reliability systems including automated deployment, validation, and rollback mechanisms for SRE tools and monitoring systems with built-in observability and performance monitoring.
  • Establish platform engineering excellence by building reusable SRE infrastructure, intelligent monitoring platforms, and developer productivity tools that serve all Walmart engineering teams.
  • Provide technical mentorship and guidance to engineering teams across all Walmart organizations on advanced SRE concepts, AI/ML for reliability, platform engineering best practices, and autonomous system design through code reviews, technical discussions, and knowledge sharing.

What you'll bring:
Education & Experience:
  • Bachelor's/Master's degree in Engineering, Computer Science, or related field with 12+ years of hands-on experience in Site Reliability Engineering, AI/ML Engineering, or Platform Engineering.
  • Proven track record as a senior individual contributor in SRE, AI/ML, or Platform Engineering with experience influencing technical decisions and driving technical excellence across teams.
  • Deep experience working with mission-critical systems with KPI expertise in MTTD, MTTR, availability, model performance, and autonomous system reliability.

Must-Have Technical Experience:
  • Expert-level AI/ML engineering experience with deep expertise in machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML system deployment at scale.
  • Advanced experience with agentic AI systems including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
  • Comprehensive Site Reliability Engineering expertise including hands-on experience with Service Management (Incident, Problem & Change Management), Performance and Capacity Engineering for AI/ML systems.
  • Expert-level cloud engineering experience (Azure, GCP, AWS) with deep knowledge of cloud-native AI/ML services, containerization (Kubernetes, Docker), and serverless architectures.
  • Deep observability and monitoring expertise with hands-on experience in:
    • Distributed tracing (Jaeger, Zipkin, OpenTelemetry) for AI/ML pipelines
    • Metrics collection and alerting (Prometheus, Grafana, DataDog) with ML-specific dashboards
    • Log aggregation and analysis (ELK stack, Splunk, Fluentd) for model and system monitoring
    • APM tools and performance monitoring for AI/ML workloads
    • AI-driven anomaly detection and predictive monitoring systems


Platform Engineering experience including:
  • Building developer platforms and internal tooling for AI/ML teams
  • Infrastructure as Code (Terraform, CloudFormation, Pulumi)
  • Service mesh architectures (Istio, Linkerd) for AI/ML services
  • API gateway and microservices platform development
  • Self-service ML deployment platforms and developer productivity tools

Industry & Domain Experience:
  • Experience in large-scale retail, e-commerce, or high-traffic consumer-facing systems with strict availability and performance requirements (strongly preferred).
  • Experience with mission-critical distributed systems serving millions of concurrent users across multiple domains (e-commerce, payments, inventory, supply chain, etc.).
  • Experience with enterprise-scale SRE implementations supporting diverse technology stacks and business-critical applications across multiple organizational domains.
  • Experience with complex multi-cloud and hybrid cloud environments supporting diverse workloads with varying reliability and performance requirements.

Technical Leadership & Collaboration Skills:
  • Technical thought leadership and influence in AI/ML architecture decisions, SRE methodologies, and platform engineering strategies across all Walmart technology domains.
  • Strong cross-functional collaboration experience working with diverse engineering teams across E-commerce, Supply Chain, Store Technology, Fintech, Security, and Platform Engineering to deliver enterprise-wide reliability solutions.
  • Excellent technical communication skills with ability to articulate complex SRE and AI/ML concepts to diverse engineering audiences and influence technical decisions across multiple organizations.
  • Mentorship and knowledge sharing experience, providing technical guidance on SRE best practices, AI/ML for reliability, and platform engineering through code reviews, technical discussions, and documentation.
  • High degree of technical ownership and accountability for complex, mission-critical reliability systems with ability to work independently on high-impact projects that span multiple engineering domains.

Preferred Technical Experience:
  • MLOps and model lifecycle management experience with tools like MLflow, Kubeflow, Seldon, or similar platforms for enterprise-scale reliability and monitoring deployments.
  • Natural Language Processing and Computer Vision expertise for building intelligent log analysis, automated incident response, visual infrastructure monitoring, and conversational AI for SRE operations.
  • Edge computing and distributed systems experience for deploying monitoring and reliability solutions across retail stores, distribution centers, and edge infrastructure.
  • Real-time streaming and event-driven architectures using Kafka, Pulsar, or similar technologies for processing high-volume operational data streams across all Walmart systems.
  • Advanced security practices for reliability systems including secure monitoring, data privacy in observability, and secure multi-tenant SRE platforms.
  • Chaos Engineering and fault injection experience across diverse system types including e-commerce, supply chain, financial services, and in-store technology.
  • Performance optimization for large-scale distributed systems including database optimization, network performance tuning, and infrastructure cost optimization.
  • Open source contribution experience in SRE, observability, and infrastructure automation tools and familiarity with industry best practices and emerging technologies.


At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.

You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.

For information about PTO, see https://one.walmart.com/notices.

Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.

Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.

For information about benefits and eligibility, see One.Walmart.

The annual salary range for this position is $169,000.00 - $338,000.00

Additional compensation includes ann

About Walmart

WalmartLabs is accelerating their development to redefine the shopping experience to meet the changing needs of our customers wherever they are —in a store, on our website, or on their mobile device.

Walmart Careers

Joining Walmart means becoming part of a world-renowned team that leads with innovation and is committed to creating impact. As the largest retailer globally, Walmart offers unparalleled job opportunities and career growth in an environment that values diversity and leadership. Work You’ll Do At Walmart, you will be part of a dynamic team that drives our mission to help people save money so they can live better. Engage in work that matters with a company that offers both stability and the flexibility to explore different career paths. Transform the retail landscape with your skills and help shape the future of millions of customers worldwide. Walmart is at the forefront of combining retail with advanced technology, making it an exciting place for professional growth and innovation. Lead with Us Step into a role that harnesses your potential and places you at the intersection of retail and technology. Walmart is not just a company; it's a community where you can develop your leadership skills and contribute to a culture that nurtures professional growth. Work with a diverse team of experts who bring a wealth of knowledge and experience to the table. Our commitment to diversity training ensures that all team members are valued and can thrive. Walmart Careers and Employment Opportunities We are continuously expanding our team to include enthusiastic professionals eager to drive change and make a significant impact. Explore a range of positions from entry-level to executive, each offering competitive benefits and the opportunity to advance. Innovate with Us Join Walmart and be part of a team that is dedicated to innovation and excellence. With over 2.3 million associates worldwide, you are joining the largest private employment group, ready to innovate, lead, and impact the global market. Internship Programs Kickstart your career with a Walmart internship. Gain invaluable experience, build your resume, and develop networking connections that will empower your career journey. Our internships provide a platform to apply your academic knowledge in real-world scenarios, preparing you for future employment. Be Part of a Great Team At Walmart, our team is our strength. We invest in our employees through robust training programs, leadership development, and opportunities for career advancement. Enjoy the benefits of working in a supportive and inclusive environment where every member’s contribution is valued. Future-Proof Your Career Your journey at Walmart can be as vast as your ambitions. With endless opportunities to grow, learn, and lead, you can take your professional experience to new heights. Benefit from our comprehensive training programs and develop the skills needed for tomorrow’s challenges. Stay Connected Join Our Team Search open positions that match your skills and interests. We are looking for passionate, curious, creative, and solution-driven team players. Explore the diverse job opportunities at Walmart and find where you can make a difference. Keep Up to Date Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here. READ CAREERS BLOG Job Alert Emails Personalize your subscription to receive job alerts, latest news, and insider tips tailored to your preferences. Discover the exciting and rewarding opportunities that await at Walmart. SEARCH WALMART JOBS Join Walmart today and be part of a story of growth, innovation, and leadership.
Learn more about Walmart
Size
2,300 employees
Market Cap
$387 billion
Industry
Net Income
$13.5 billion
Founded
1962
5 Year Trend
+3.3%
Revenue
$559.1 billion
NASDAQ

Similar Jobs

More Jobs at Walmart

More Information Technology Jobs

Find similar (USA) Distinguished, Software Engineer-AI/ML Engineer - Agentic Systems & Site Reliability Engineering jobs: