Full Job Description
As part of Retail Engineering Store Operations & Support, you will play a crucial role in detecting and resolving issues that impact our global retail environment, This role sits at the intersection of production support, SRE, and applied AI engineering. This is a team scaling fast, going deep technically, and investing heavily in the next generation of automation, agentic operations orchestration, and operational intelligence.
This is a hands-on engineering role with high visibility across Apple's global retail technology landscape and an opportunity to shape how production support operates at scale.
As a Senior Operations Engineer on the transactional operations vertical, you will own deep technical expertise across some of Apple Retail's most critical systems, including Point of Sale, Apple Financial Services, Carrier Services, Runner, and Catalog. You will combine that domain expertise with hands-on troubleshooting, building and extending GenAI agents, screening tools, and automation that fundamentally change how our team detects, investigates, and resolves issues.
You will partner closely with Engineering, SRE, and business teams to drive root cause analysis, deliver process improvements, and bring clarity to complex technical problems for both technical and non-technical stakeholders.
The ideal candidate is a proactive problem solver who thrives in a fast-moving production environment and is energized by production environments where speed, scale, and precision all matter, who codes when needed, and who communicates with precision across audiences. If you are passionate about supporting reliable, high-impact systems that serve millions of customers worldwide, this may be the perfect opportunity for you.
Bachelor's degree or higher in Computer Science, Information Technology, or a related field, or equivalent work experience.
5+ years of experience supporting critical, customer-facing systems in a high-volume production environment.
3+ years of hands-on experience with incident management platforms (e.g. ServiceNow) and issue tracking tools (e.g. Jira).
3+ years of practical experience with Splunk, including dashboard creation, SPL querying, and alert configuration for production triage, performance degradation analysis, and incident resolution.
3+ years of experience performing structured root cause analysis using application logs, telemetry, distributed traces, and customer feedback across complex, multi-system environments.
Hands-on experience building and orchestrating agentic workflows with SOTA language models, LLM-based automation, or AI-augmented operational tooling.
Demonstrated ability to work across distributed systems, APIs, and microservices at an architectural level - understanding how failures propagate across system boundaries.
Strong understanding of networking fundamentals (TCP/IP, DNS, HTTP/TLS, load balancing) with the ability to diagnose connectivity issues between distributed systems.
Strong root cause analysis skills using diverse data sources including application logs, telemetry, and customer feedback.
Experience with scripting languages (Python preferred) for log analysis, data investigation, and lightweight automation of operational workflows.
Excellent ability to communicate complex technical issues clearly and concisely to both technical and non-technical stakeholders.
Experience coordinating with distributed teams across multiple time zones.
Familiarity with CI/CD pipelines, Git-based version control, and release management/deployment processes.
Experience with data visualization and analysis tools such as Tableau or Power BI.
Proficiency in ITIL practices, including incident, problem, and change management.
Experience validating data feeds between SAP (MM, SD, FI) and downstream retail systems, with the ability to identify and troubleshoot discrepancies in inventory, pricing, and order data.
Previous experience supporting eCommerce platforms or Retail / Payment systems at scale in a plus.