Senior Site Reliability Automation Engineer

IBM   •  

Littleton, MA

Industry: Hardware


8 - 10 years

Posted 391 days ago

This job is no longer available.

Job Description

We are looking for a dynamic, Senior Site Reliability Automation Engineer (Sr. SRAE) in Littleton, MA to join our Cloud Innovation Lab (CIL) Team, who is responsive to market needs, to deliver value to our clients in a fast changing cloud landscape.  The CIL team dedicated to ensuring that the IBM Cloud is at the forefront of cloud technology, from data center design to network architecture to storage and compute clusters to flexible infrastructure services. We are building IBM's next generation cloud platform to deliver performance and predictability for our customers' most demanding workloads, at global scale and with leadership efficiency, resiliency and security. It is an exciting time, and as a team we are driven by this incredible opportunity to thrill our clients.
In this Sr. SRAE role, you will work closely with the Data Center, the entire Cloud Innovation Lab development organization and IBM vendors to support, maintain and operationally improve the cloud infrastructure.  You will focus on the following key responsibilities:

  • Automate health monitoring the health of production and test systems
  • Automate return to service procedures for Cloud Platform Components
  • Integrate automation with operational requirements
  • Work withEngineeringto:
    • Define operational requirements
    • Automate operational requirements
    • Participate in the full deployment pipeline
  • Work with Partners to:
    • Identify and resolve issues
    • Discuss and plan integration requirements
    • Integrate new components (object store, GPU etc) into the Platform

    uiredTechnical and Professional Expertise

    • A minimum of 4 years’ experience with Python, Bash, or otherscriptinglanguage
    • Solid understanding of how DNS works
    • A minimum of 8years’ experience within a public cloud offering (e.g., AWS, Softlayer)
    • A minimum of 4 years’ experience with configuration management systems (e.g., Ansible, Chef)
    • A minimum of 4 years’ experience using splunk and/or ELK
    • Must be extremely comfortable using and navigating within a Linux environment
    • Experience with standard version control systems (Github, Gitlab, etc.)

    Preferred Technical and Professional Experience

    • Minimum of 10 years’ experience in hands-on production administration of large system environment
    • Understanding of TCP stack
    • Experience with routing/switching protocols