What you'll do...
Position: Senior Site Reliability Engineer
Job Location: 14901 Quorum Drive, Dallas, TX 75254
Duties: Assist in creating simple, modular, extensible, and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a product based on business requirements. Convert HLD to create detailed design using mock screens, pseudo codes, and detailed functional logic of the modules for specific modules and components of a product/system. Understand nuances of designing for disaster recovery. Design and create MVP to clarify requirements and design and uncover risks. Refine the MVP design for early defects and revised customer requirements. Undertake infrastructure coding automation. Adhere to all relevant coding guidelines while writing/configuring code. Create/configure minimalistic (less complex, highly robust, and high quality) code for a component/module under guidance. Maintain records by documenting program development and revisions. Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery. Identify repetitive and routine tasks in (Continuous Integration/Continuous Delivery) CI/CD, testing, or any other process that can be automated. Implement telemetry features as required under guidance. Apply security policy requirements to component/module during code development/configuration. Detect and document defects, bugs, and errors for assigned component/module and conduct analysis to determine the sources under guidance. Troubleshoot performance and availability bottlenecks for assigned application under guidance. Work with business partners to identify and document critical applications. Interpret and follow procedures in contingency plans. Explain the contingency and disaster recovery plans for assigned environment. Execute established procedures necessary to continue operations in an emergency. Participate in the design of a minimum operating environment for a computer-based facility. Utilize established criteria (for example, probability of failure, frequency of failure) to measure site reliability. Monitor site reliability conditions and new reliability requirements. Assist in the design and development of a reliability program plan for a specific site environment. Apply appropriate tools, services, or applications for reliability prediction and other site improvements. Research and assess various reliability models for different site environments. Suggest metrics to monitor software or system performance. Monitor current performance data to ensure compliance with defined SLOs for multiple applications/systems. Determine thresholds for monitoring metrics and triggers alerts based on thresholds. Help with specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software. Make recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic, including a variety of operating systems, hardware, and software.
Minimum education and experience required: Master's degree or the equivalent in Computer Science, Computer Engineering, Computer Information Systems, Software Engineering, Electrical Engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor's degree or the equivalent in Computer Science, Computer Engineering, Computer Information Systems, Software Engineering, Electrical Engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area.
Skills required: Experience with the management and orchestration of Kubernetes cluster with helm charts. Experience with networking solutions including VPN systems, firewall technologies, and storage systems. Experience building scalable monitoring and observability systems using CloudWatch, PRTG, Grafana, and PagerDuty. Experience with server management in AWS with orchestration tools, including Ansible, Puppet, and Terraform. Experience managing DNS and SSL certificates in AWS. Experience managing Enterprise Workloads in an AWS Infrastructure. Experience building CI/CD pipelines using GitHub Action, CodeBuild, CodePipeline, and CircleCI. Experience managing RDBMS including PostgreSQL and MSSQL Server and non-RDBMS including Redshift and MongoDB. Experience writing unit and integration tests. Experience with tool development, including scripting with BASH and high level languages: Python and Typescript. Employer will accept any amount of experience with the required skills.
Salary Range: $112,923/year to $180,000/year. Additional compensation includes annual or quarterly performance incentives.
Benefits: At Walmart, we offer competitive pay as well as performance-based incentive awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty and voting. Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more.
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart.com.