About the TeamAt eBay, we connect millions of people with the things they need and love-anytime, anywhere. The Americas organization plays a critical role in delivering a seamless, personalized shopping and selling experience across the U.S., Canada, and Latin America.
We operate at massive scale, where reliability, speed, and trust directly impact customer experience and business outcomes. Our team spans marketing, merchandising, operations, and platform engineering, working together to build, accelerate, and redefine commerce.
Role OverviewWe are looking for a
Problem Manager within the Site Engineering and ITSS organization who will
own and drive end-to-end Problem Management outcomes across the Americas platform.
This role goes beyond process coordination-you will
lead the identification, prioritization, and elimination of systemic issues impacting site reliability and customer experience. You will partner closely with Incident Management, Change Management, and Engineering teams to
reduce repeat incidents, improve platform stability, and drive accountability for long-term fixes.
You will play a key role in transforming Problem Management into a
data-driven, proactive discipline that prevents incidents before they occur.
Key ResponsibilitiesOwn Problem Management Outcomes- Own the end-to-end lifecycle of problems, from detection through root cause analysis to permanent resolution
- Drive reduction of repeat and high-impact incidents across the platform
- Ensure timely progress, visibility, and closure of problems aligned with defined SLAs
Drive Proactive Detection & Prevention- Leverage incident trends, telemetry, and observability data to identify systemic risks and emerging patterns
- Partner with engineering teams to prevent incidents before they occur, not just react to them
- Establish and evolve early warning indicators for platform health
Lead High-Quality Root Cause Analysis- Facilitate blameless post-incident reviews that produce actionable, measurable outcomes
- Ensure root causes go beyond symptoms to identify systemic, architectural, or process gaps
Drive Accountability Across Teams- Partner with engineering, product, and operations teams to prioritize and deliver corrective actions
- Hold stakeholders accountable for commitments, timelines, and outcomes
- Escalate risks and blockers effectively to ensure resolution
Align with Business Impact- Prioritize problems based on customer experience and business impact (e.g., checkout failures, listing disruptions, revenue impact)
- Communicate clearly with stakeholders at all levels, including executive audiences when needed
Strengthen Cross-Functional Operations- Collaborate closely with Incident and Change Management to ensure seamless lifecycle integration
- Participate in daily operational cadences and drive alignment across teams
- Build strong working relationships across global and regional organizations
Define and Track Success Metrics- Establish and monitor KPIs such as:
- Reduction in repeat incidents
- Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) improvements
- Problem backlog health and SLA adherence
- Use metrics to continuously improve Problem Management maturity
Required Qualifications- 3+ years of experience in Problem Management, Incident Management, or Site Reliability/Operations roles
- Proven experience managing major incidents and driving post-incident resolution
- Strong analytical and problem-solving skills with the ability to connect technical issues to business impact
- Excellent communication skills with the ability to influence across engineering and business teams
- Demonstrated ability to drive accountability without direct authority
- Experience working in fast-paced, large-scale distributed systems environments
- High emotional intelligence and ability to navigate complex, cross-functional relationships
Preferred Qualifications- Experience with Change Management processes and release coordination
- ITIL v3 or v4 certification (advanced preferred)
- Familiarity with observability tools, incident analytics, or SRE practices
- Experience in e-commerce or high-traffic consumer platforms
What Success Looks Like- Measurable reduction in repeat and high-severity incidents
- Faster recovery times and improved platform reliability
- Strong adoption of blameless RCA practices with actionable outcomes
- Increased engineering ownership of systemic fixes
- A shift from reactive incident response to proactive problem prevention
Additional DetailsThe base pay range for this position is expected in the range below:
$70,000 - $125,000
Base pay offered may vary depending on multiple individualized factors, including location, skills, and experience. The total compensation package for this position may also include other elements, including a target bonus and restricted stock units (as applicable) in addition to a full range of medical, financial, and/or other benefits (including 401(k) eligibility and various paid time off benefits, such as PTO and parental leave). Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
If hired, employees will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.
Remote roles are not eligible for U.S. visa sponsorship.