Site Reliability Engineer

American Express   •  

Phoenix, AZ

Industry: Finance & Insurance


8 - 10 years

Posted 47 days ago

This job is no longer available.

Why American Express?

There's a difference between having a job and making a difference.

American Express has been making a difference in people's lives for over 160 years, backing them in moments big and small, granting access, tools, and resources to take on their biggest challenges and reap the greatest rewards.

We've also made a difference in the lives of our people, providing a culture of learning and collaboration, and helping them with what they need to succeed and thrive. We have their backs as they grow their skills, conquer new challenges, or even take time to spend with their family or community. And when they're ready to take on a new career path, we're right there with them, giving them the guidance and momentum into the best future they envision.

Because we believe that the best way to back our customers is to back our people.

The powerful backing of American Express.

Don't make a difference without it.

Don't live life without it.

Are you someone that says, "Why not?" rather than "Why?" Are you someone that lays down new paths? Do you love to dream bold, explore and discover new experiences?

The success of our entire company rests on our systems, networks, and people. Ours is a team of highly skilled DevOps, ProdOps and SRE engineers that strongly advocate automation and monitoring across all the applications and platforms we support. In the last few years, with innovation at its core and a never-say-die attitude, this team has been shaping the digital future at AmEx while becoming the poster child for defining the art of possible!

As a Site Reliability Engineer (SRE), you will be responsible for a broad range of activities. You will work closely with application development teams to build standards that drive the highest levels of availability across our critical Servicing, Messaging and Marketing portfolios. You will join a team that provides 24/7 support and are expected to develop solutions that improve production support and monitoring services, while responding to incidents to ensure a high level of availability of applications. You will also drive engineering work, including things such as infrastructure automation, designing and building tools, as well as code to support our application teams.

In this role, you will be responsible for (but not limited to) the following:

  • You will lead a team of DevOps, ProdOps and SRE engineers in supporting critical Servicing, Messaging and Marketing applications.
  • Work closely with our application engineering teams to launch and maintain applications both on-premise and hybrid-cloud.
  • Act as primary escalation point for our L1 support team in helping to make decisions to restore service and minimize impact to availability.
  • Provide production support and respond to production incidents as the first line of defense for the organization
  • Diagnose intricate software problems, provide solutions and workarounds to ensure the highest level of reliability and availability for critical applications.
  • Facilitate the resolutions of non-application issues (3rd party upstream issues, infrastructure issues, storage, database, network, file transfer etc.).
  • Debug network and performance issues in large scale distribute systems.
  • Provide consultation and strategic recommendations by quickly assessing and remediating complex availability issues.
  • Participate and oversee overall upgrades or migration of platforms and applications to production, and other planned maintenance activities.
  • Drive monitoring requirements to ensure business-service level visibility for all support teams
  • Introduce new and impactful technologies to the production support tool chain. This helps minimize friction for production releases and that results in quick diagnosis and recovery from production incidents.
  • Challenge the status quo, identify opportunities to adopt innovative technologies to enable business capabilities, generate creative ideas and solutions to difficult problems.
  • Have an "Automation First" mindset in order that repetitive tasks are not manually handled.
  • Be highly influential at all levels, including peers, leaders and key stakeholders. Distill complex ideas and concepts with clear, structured, easy to understand language.


  • 8+ year's software development experience, including experience in a DevOps environment
  • Experience with Java/J2EE/UI applications
  • BS degree in Computer Science, Computer Engineering, other Technical discipline, or equivalent work experience.
  • Experience supporting a 24/7 enterprise environment with on-call responsibilities for production support
  • Broad technical field exposure, with preference to following skills: Cloud Infrastructure, VM, load balancing, containers, JVMs, web servers, application debugging, queuing technologies, caching technologies, databases, routing and switching, etc.
  • Knowledge of Linux internals and experience managing Linux systems in high traffic environments.
  • Experience managing relational and NoSQL databases such as Oracle, Couchbase.
  • Hands-on experience leveraging enterprise tools such as Splunk, Grafana, Dynatrace, AppDynamics.
  • Strong interpersonal communication skills and the ability to work well in a diverse team-focused environment
  • Google Cloud, Python, Hive, Hadoop a plus

Employment eligibility to work with American Express in the U.S. is required as the company will not pursue visa sponsorship for these positions.