Oracle Corporation

Principal TPM -AI Infrastructure

Oracle Corporation$102K — $209K *
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years in technical program management or related field.
  • Experience leading cross-functional initiatives with measurable outcomes.
  • Strong operational background in governance, incident/change processes, and risk management.
  • Excellent written and verbal communication skills for executive updates and recommendations.
  • Highly organized, able to manage multiple priorities in ambiguous situations.
  • Advanced Excel user with skills in data analysis and modeling.
  • Experience with dashboarding and reporting tools.

Responsibilities

  • Drive availability and reliability of large-scale GPU fleets, leading recovery efforts.
  • Support operational performance of AI workloads across multi-region GPU clusters.
  • Own end-to-end execution of AI Infrastructure GPU operations programs.
  • Run weekly governance forums across multiple initiatives, ensuring clarity in ownership and timelines.
  • Manage deployment governance and establish incident management mechanisms.
  • Build and maintain reports and forecasts for GPU operations programs.
  • Strengthen partnerships with internal and external stakeholders for operational efficiency.

Benefits

  • Medical, dental, and vision insurance with expert opinion options.
  • Short and long term disability insurance.
  • Life and AD&D insurance, including supplemental options.
  • Flexible Spending Accounts for healthcare and dependent care.
  • 401(k) plan with company match and paid vacation time.
  • 11 paid holidays and paid sick leave that carries over.
  • Paid parental leave and adoption assistance.
Full Job Description
Job Description

The AI Infrastructure GPU Operations Team drives deployment planning, execution governance, operational readiness, reliability, and business rhythm for OCI's rapidly expanding GPU infrastructure portfolio. As Principal Technical Program Manager, you will lead cross-functional programs that connect engineering, platform, operations, business, finance, observability, SRE, network, and leadership teams across complex GPU operations initiatives.

You will own operating mechanisms for regional deployment readiness, GPU fleet health, milestone tracking, executive reporting, incident and change governance, risk management, and operational handoff across multiple concurrent GPU operations programs. This role requires strong program discipline, business analytics capability, and the ability to turn ambiguous technical and operational inputs into clear priorities, metrics, decisions, and action plans.

You will also improve the way the organization scales by strengthening dashboards, telemetry, documentation, onboarding, playbooks, repeatable processes, and the practical use of AI to improve operations productivity. The ideal candidate brings crisp communication, strong ownership, and pragmatic simplification to high-visibility GPU operations programs where disciplined execution, customer impact, and measurable reliability outcomes matter.

You are a structured, data-driven program leader who values simplicity, scalability, reliability, and clear operational mechanisms. You thrive in collaborative environments, communicate crisply with senior stakeholders, and drive consistent execution through ownership, metrics, and disciplined follow-through. You combine strategic clarity with enough technical and operational depth to help teams deliver reliable OCI AI Infrastructure GPU Operations while continuously improving the processes, telemetry, and automation that support it.

Travel: as needed for cross-site coordination, stakeholder alignment, and partner engagements.

Responsibilities

Key Responsibilities

GPU Fleet Operations & Reliability
  • Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.
  • Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.
  • Lead GPU fleet health reviews across current and next-generation hardware, including NVIDIA H200, B200, GB200/GB300 platforms and AMD Instinct MI300X, MI325X, MI350X, MI355X, and related platforms.

Program Leadership & Execution
  • Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.
  • Set and run weekly operating cadences and governance forums across multiple concurrent initiatives, ensuring clear ownership, timelines, dependencies, decision points, and committed actions.
  • Coordinate cross-functional delivery across engineering, platform, operations, business operations, finance, observability, SRE, network, and senior leadership stakeholders.

Incident, Change & Deployment Governance
  • Manage deployment governance, change review, readiness tracking, stakeholder handoff, and operational execution processes.
  • Establish and scale structured incident management mechanisms, improving root cause analysis, corrective and preventive actions, and follow-through on durable fixes.
  • Serve as a primary escalation point between engineering and operations teams, resolving priority conflicts and accelerating issue resolution.
  • Lead Change Review Board processes for high-volume change activity, minimizing change-related incidents and protecting service quality.

Business Planning, Metrics & Executive Reporting
  • Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.
  • Own executive-level reporting, including monthly business reviews, weekly operational KPIs, critical project updates, risks, dependencies, decisions, and mitigation plans.
  • Provide data-driven insights into infrastructure performance, operational risk, customer impact, and measurable program outcomes for senior leadership.

Cross-Functional & Stakeholder Engagement
  • Strengthen partnerships with hardware vendors, cloud platform teams, SRE, cloud engineering, network teams, and other internal stakeholders to improve issue resolution and operational efficiency.
  • Translate complex technical, operational, and business situations into accurate narratives, recommendations, and action plans for senior stakeholders.
  • Drive structured escalation and bug reporting mechanisms that reduce time-to-resolution for critical issues.

Operational Excellence, Optimization & AI Productivity
  • Create and maintain documentation, playbooks, onboarding materials, runbooks, and repeatable processes that reduce ambiguity and improve execution quality.
  • Drive practical use of AI and automation to improve operations productivity, reduce manual toil, accelerate triage, improve ticket prioritization, and strengthen repeatability across GPU operations workflows.
  • Partner with observability and telemetry teams to improve infrastructure visibility, including RDMA telemetry, network fabric health, service health metrics, and operational dashboarding.
  • Lead continuous improvement efforts such as validation frameworks, version set validation, link flap analysis, and long-tail performance optimization.
  • Monitor and improve operational health across technologies such as RoCE, InfiniBand, and large-scale data center networks.

Qualifications / Experience
  • 5+ years of experience in technical program management, program operations, business operations, data analysis, infrastructure operations, or a related discipline.
  • Demonstrated ability to lead complex, cross-functional initiatives with measurable outcomes across technical, operations, business, and customer-facing stakeholders.
  • Strong operational background with experience building cadences, governance mechanisms, KPI reporting, incident/change processes, risk management processes, or readiness programs.
  • Strong written and verbal communication skills; comfortable synthesizing complex technical and operational information into executive updates, recommendations, and decisions.
  • A high degree of organization and ability to manage multiple competing priorities independently through ambiguity.
  • Experience identifying, measuring, and adjusting execution plans against key business, operational, reliability, or delivery metrics.
  • Advanced Excel skills, including pivots, lookups, conditional logic, data modeling, and financial or operational analysis.
  • Experience developing dashboards, automated reporting, or analytical tools that provide reliable business and operational visibility.
  • Working knowledge of PowerPoint, Jira, Confluence, and related collaboration or delivery management tools.

Preferred / Nice to Have
  • Experience with cloud infrastructure, AI/ML infrastructure, GPU operations, data center deployment, capacity planning, or large-scale platform operations.
  • Experience supporting large GPU fleets, distributed AI training or inference workloads, or performance-sensitive infrastructure environments.
  • Experience with incident management, root cause analysis, corrective and preventive action tracking, Change Review Board processes, or high-volume change governance.
  • Familiarity with observability, telemetry, RDMA, RoCE, InfiniBand, network fabric health, service health metrics, ticket/incident analytics, or operational dashboarding.
  • Finance, business planning, workforce planning, or operational readiness experience in a technology organization.
  • Track record of influencing senior business and technology leaders without relying on direct authority.


Qualifications

US: Hiring Range in USD from: $102,300 to $209,500 per annum. May be eligible for bonus and equity.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business.
Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:
1. Medical, dental, and vision insurance, including expert medical opinion
2. Short term disability and long term disability
3. Life insurance and AD&D
4. Supplemental life insurance (Employee/Spouse/Child)
5. Health care and dependent care Flexible Spending Accounts
6. Pre-tax commuter and parking benefits
7. 401(k) Savings and Investment Plan with company match
8. Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
9. 11 paid holidays
10. Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
11. Paid parental leave
12. Adoption assistance
13. Employee Stock Purchase Plan
14. Financial planning and group legal
15. Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.
Career Level - IC4

About Oracle Corporation

Oracle Dyn Global Business Unit is a pioneer in managed DNS and a leader in cloud-based infrastructure that connects users with digital content and experiences across a global internet. Dyn's solution is powered by a global network that drives 40 billion traffic optimization decisions daily for more than 3,500 enterprise customers, including preeminent digital brands such as Netflix, Twitter, Linkedin and CNBC. Adding Dyn's best-in-class DNS and email services extend the Oracle cloud computing platform and provides enterprise customers with a one-stop shop for Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS). On January 31, 2017 Oracle completed the acquisition of Dyn, which now operates as an Oracle Infrastructure-as-a-Service (IaaS) global business unit (GBU).

Oracle Corporation Careers

Join Oracle Corporation, a global leader in technology and innovation, and be part of a team that values professional growth, leadership, and diversity. At Oracle, we offer unparalleled job opportunities in the tech industry, fostering a culture of innovation and continuous improvement.

Work You’ll Do

At Oracle, your work will directly impact the future of technology across industries. As part of our team, you will lead projects that redefine the way businesses operate, leveraging Oracle’s cutting-edge technology solutions. Our commitment to leadership in the tech community means you’ll be working at the forefront of innovation, enhancing your skills through hands-on experience and comprehensive diversity training.

Join Our Dynamic Team

Oracle is not just a technology company; we are a team of dedicated professionals committed to creating a supportive and inclusive environment. Here, every team member’s contribution is valued, and diversity is celebrated. With Oracle, you are not just accepting a job; you are joining a community that promotes personal and professional growth through constant learning and development opportunities.

Innovative Work and Career Advancement

Embrace the chance to do innovative work with Oracle Corporation, where we push the boundaries of what is possible. With over 130,000 dedicated professionals globally, Oracle offers a workplace where innovation and thought leadership thrive. This environment is perfect for those who are driven to explore new ideas and are eager for opportunities to advance their careers.

Explore Job Opportunities and Internships

Whether you’re a seasoned professional looking for your next career challenge or a student seeking a promising internship, Oracle provides a range of opportunities. Explore positions that match your skills and interests in areas such as cloud computing, enterprise software, and business analytics. Our hiring process is designed to find not just the right skills but also the right fit for Oracle’s unique culture.

Benefits and Culture

Oracle is committed to supporting our employees’ life and work ambitions. We offer competitive benefits, including health insurance, retirement plans, and wellness programs, all designed to support your career and well-being. Our culture of empowerment encourages networking and collaboration across teams and geographies, ensuring that innovation and creativity flourish.

Develop Your Skills Through Training and Networking

Prepare for your future with Oracle’s comprehensive training programs. From leadership development to technical skills enhancement, we provide the tools necessary to succeed in your career and stay ahead in the industry. Networking within Oracle’s global community will also open doors to collaborative opportunities and career advancement.

Stay Connected with Oracle Careers

Keep up to date with the latest from Oracle Corporation by following our careers blog. Gain insights from the experts and learn about new job openings as they become available. Personalize your job search and stay informed about Oracle’s career events and professional development opportunities.

Join Oracle Corporation—Where Careers Grow

At Oracle, we believe in nurturing the potential of our employees. The growth of our company is driven by the individual successes of our team members. We invite you to bring your unique talents to Oracle, join our mission to drive technological innovation, and help shape the future of the digital world.

Search Oracle Jobs

Ready to take the next step in your career? Search for open positions that align with your skills and passions. We are continuously looking for curious, creative, and motivated individuals to join our team. Explore the opportunities and find out how you can contribute to the success of Oracle Corporation.

Oracle Corporation: Leadership, Innovation, Opportunity.

Learn more about Oracle Corporation
Size
143,000 employees
Market Cap
$217.3 billion
Industry
Net Income
$12.8 billion
Founded
1977
5 Year Trend
+2.3%
Revenue
$39.6 billion
NASDAQ

Similar Jobs

More Jobs at Oracle Corporation

More Enterprise Technology Jobs

Find similar Principal TPM -AI Infrastructure jobs: