Senior Site Reliability Engineer

Walmart, Inc.

$112K — $180K *
US-AnywhereRemote in Rockville Centre, NY
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Master's degree in a relevant field plus 1 year experience, or Bachelor's degree with 3 years experience in site reliability engineering or related areas.
  • Experience with Kubernetes management and orchestration using helm charts.
  • Proficient in networking technologies such as VPNs, firewalls, and storage solutions.
  • Skilled in building monitoring systems using CloudWatch, Grafana, and similar tools.
  • Experience with AWS for server management and orchestration using tools like Ansible and Terraform.
  • Familiar with CI/CD pipeline development using GitHub Action and other tools.
  • Proficient in writing unit and integration tests, as well as scripting with BASH, Python, and Typescript.

Responsibilities

  • Create modular and functional design solutions for product requirements.
  • Evaluate design trade-offs across multiple components based on business needs.
  • Develop detailed designs from high-level designs using mock screens and pseudo code.
  • Design MVPs to clarify requirements and identify risks.
  • Automate infrastructure coding and streamline CI/CD processes.
  • Document development processes, program revisions, and performance data.
  • Monitor site reliability metrics to ensure compliance with established SLOs.

Benefits

  • Competitive salary and performance-based incentives.
  • Comprehensive health benefits including medical, vision, and dental coverage.
  • 401(k) plan with company matching and stock purchase options.
  • Generous paid time off including sick leave and bereavement.
  • Education assistance for 100% company-paid college degrees.
  • Additional benefits like adoption reimbursement and military service pay.
Full Job Description
What you'll do...

Position: Senior Site Reliability Engineer

Job Location: 14901 Quorum Drive, Dallas, TX 75254

Duties: Assist in creating simple, modular, extensible, and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a product based on business requirements. Convert HLD to create detailed design using mock screens, pseudo codes, and detailed functional logic of the modules for specific modules and components of a product/system. Understand nuances of designing for disaster recovery. Design and create MVP to clarify requirements and design and uncover risks. Refine the MVP design for early defects and revised customer requirements. Undertake infrastructure coding automation. Adhere to all relevant coding guidelines while writing/configuring code. Create/configure minimalistic (less complex, highly robust, and high quality) code for a component/module under guidance. Maintain records by documenting program development and revisions. Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery. Identify repetitive and routine tasks in (Continuous Integration/Continuous Delivery) CI/CD, testing, or any other process that can be automated. Implement telemetry features as required under guidance. Apply security policy requirements to component/module during code development/configuration. Detect and document defects, bugs, and errors for assigned component/module and conduct analysis to determine the sources under guidance. Troubleshoot performance and availability bottlenecks for assigned application under guidance. Work with business partners to identify and document critical applications. Interpret and follow procedures in contingency plans. Explain the contingency and disaster recovery plans for assigned environment. Execute established procedures necessary to continue operations in an emergency. Participate in the design of a minimum operating environment for a computer-based facility. Utilize established criteria (for example, probability of failure, frequency of failure) to measure site reliability. Monitor site reliability conditions and new reliability requirements. Assist in the design and development of a reliability program plan for a specific site environment. Apply appropriate tools, services, or applications for reliability prediction and other site improvements. Research and assess various reliability models for different site environments. Suggest metrics to monitor software or system performance. Monitor current performance data to ensure compliance with defined SLOs for multiple applications/systems. Determine thresholds for monitoring metrics and triggers alerts based on thresholds. Help with specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software. Make recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic, including a variety of operating systems, hardware, and software.

Minimum education and experience required: Master's degree or the equivalent in Computer Science, Computer Engineering, Computer Information Systems, Software Engineering, Electrical Engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor's degree or the equivalent in Computer Science, Computer Engineering, Computer Information Systems, Software Engineering, Electrical Engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area.

Skills required: Experience with the management and orchestration of Kubernetes cluster with helm charts. Experience with networking solutions including VPN systems, firewall technologies, and storage systems. Experience building scalable monitoring and observability systems using CloudWatch, PRTG, Grafana, and PagerDuty. Experience with server management in AWS with orchestration tools, including Ansible, Puppet, and Terraform. Experience managing DNS and SSL certificates in AWS. Experience managing Enterprise Workloads in an AWS Infrastructure. Experience building CI/CD pipelines using GitHub Action, CodeBuild, CodePipeline, and CircleCI. Experience managing RDBMS including PostgreSQL and MSSQL Server and non-RDBMS including Redshift and MongoDB. Experience writing unit and integration tests. Experience with tool development, including scripting with BASH and high level languages: Python and Typescript. Employer will accept any amount of experience with the required skills.

Salary Range: $112,923/year to $180,000/year. Additional compensation includes annual or quarterly performance incentives.

Benefits: At Walmart, we offer competitive pay as well as performance-based incentive awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty and voting. Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more.

Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart.com.

#LI-DNI #LI-DNP

Similar Jobs

More Jobs at Walmart, Inc.

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: