In this role, you will maintain detailed documentation of problem resolution activities; provide identification and resolution of complex product/service problems both by phone, remote connectivity, and on-site.
The Site Reliability Engineer will deploy the release of new technologies as well as design, install, configure, maintain and perform system integration testing of PC/ server operating systems, related utilities and hardware.
As a Site Reliability Engineer you will be part of our team that works very closely with our application development team to provide high availability services to users that use Splunk as a service using cloud infrastructure. You will provide input on and execute security and deployment practices, scaling and metrics, as well as running general day-to-day server management.
In this role, you will be responsible for
the management and automation of production environments and applications including availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
This position requires senior level experience and capability in the areas of failure analysis and corrective action, with accountability for measured improvement in product yield, quality, and/or reliability.
The candidate will drive infrastructure technology plans, specifications and implementations that result in a highly available server, workstation, storage and network infrastructure based on simplicity, resiliency and manageability.
In this role, the selected candidate will promote effective technical collaboration among DoD stakeholders; support the appropriate posture for use of R&M guidance and techniques within the systems acquisition lifecycle.
The candidate will promote the efficient application and of R&M best practices with DOD stakeholders. Improve systems engineering and development planning policy, guidance, and tools to support the use of principles and best practices by the DOD acquisition community.