Site Reliability Engineer 2 in Redmond, WA

View All Enterprise Technology jobs

Industry:

Enterprise Technology   •  

Less than 5 years

Posted 7 weeks ago

Core Services Engineering builds and manages the critical products and services that Microsoft runs on. We boldly pursue big ideas that power transformational advances at Microsoft and for our customers, while helping Microsoft teams work smarter, faster and more securely every day. Core Services Engineering employees have deep technical and business expertise, customer insights, and a clear point of view that comes from first-hand, large-scale experience with Microsoft and industry solutions. We are engineers, technology leaders and experts, digital transformation change agents, and customer advocates. We have exciting opportunities for you to innovate, influence, transform, inspire and grow within our organization and we encourage you to apply to learn more!

Microsoft has been a leading company in computing for decades. We are a global company, relied on by companies, governments, utilities, stores, schools, universities and co-operatives to deliver the things they need to work, every day. In order to make this work, we need to make it reliable. In order to make it reliable, we need you -- someone who already is, or is interested in becoming, a Site Reliability Engineer (also known as SRE), within our SAS Site Reliability Engineering team.

The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Site Reliability Engineer, you will identify and deliver service improvements using your expertise in services engineering, systems, networks and software know-how, reliability and dependency analysis and scalable system design principles. Strong collaboration skills will be required to work closely with other engineering teams, service owners and support teams to ensure services/systems are highly stable and performant, meeting the expectations of our user base across the company.

Site Reliability Engineering is a hybrid role, comparatively rare in industry but crucially important to how things work behind the scenes today. SREs are people who take engineering-based approaches to solving operations problems; we like infrastructure, we like seeing how the big complicated thing works, and most importantly, we gain great satisfaction from making it better.

Our Site Reliability engineers are persistent problem solvers, always focused on mitigating issues and owning a problem until resolution is in place. To accomplish this, they work in close collaboration with various engineering teams. They are also involved in automation, developing tools to support DevOps model, and analyzing vast amounts of data to find trends and suggest improvements. Creativity and data-driven decision making is heavily valued in this emerging role.

Site Reliability Engineers build, monitor, and maintain the systems and infrastructure that ensure our customers can quickly access their data and run workloads whenever they need to. We identify service problems and areas for improvement, and we help implement solutions. Our work is key to the security and credibility of many of the Microsoft services and Microsoft's credibility. Secure Admin Services provide access to Microsoft's entire infrastructure and ecosystem in a secure manner.

Responsibilities

Responsibilities:

  • Provide technical engineering for a cross-functional, highly visible, operations team supporting the secure access services platform for Microsoft's corporate network.
  • Identify opportunities and drive the implementation of automation to improve service health, manageability, reliability and telemetry.
  • Own, triage, investigate and resolve service issues with an emphasis on broad communications, learning & teaching throughout the process
  • Ability to read, write, configure, design, and script end-to-end service telemetry, alerting and self-healing capabilities for platforms.
  • Authoring functional and technical documentation.
  • Communicate on a deeply technical level with product engineering, project management and operations teams to improve and optimize products, improve infrastructure, and evolve services.
  • Remain current on new technologies, methods and procedures including, but not limited to, coding practices such as Test Driven Development, Continuous Integration, and Continuous Deployment.

Qualifications

Required Qualifications:

  • BA/BS in Computer Science, Computer Engineering or related technical discipline, or equivalent work experience.
  • 3+ years of experience with the Microsoft Windows server architecture and/or Microsoft stack including O365, Azure, Windows or other Microsoft software/services.
  • 3+ years of experience with full-stack troubleshooting across networks, applications, hardware, management fabric or distributed services layers.

Preferred Qualifications:

  • 5+ years of scripting and programming experience (preferably .NET, PowerShell, Python, C#).
  • Familiarity with one or more general purpose programming languages including but not limited to: Java, C/C++, C#, Python, JavaScript, PowerShell.
  • Experience leveraging cloud architecture, applying site reliability principles, and/or demonstrating sensitivity to operational concerns.
  • Demonstrated ability to debug, fix, and optimize code.
  • Excellent troubleshooting skills are a must to be successful in this role.
  • Out of the box, quick and agile thinking to adapt to fast pace and changing environment.
  • Deep knowledge of system design & architecture, and running of complex, large scale online services.
  • Demonstrated technical experience with site reliability engineering or software development and operations.
  • Experience building distributed cloud-based software services.
  • Fast learner, introspective.
  • Ability to contribute to multiple projects/demands simultaneously.