Sr. Manager of Infrastructure Operations & Site Reliability Engineering

Smarsh   •  

Portland, OR

Less than 5 years

Posted 269 days ago

This job is no longer available.

Smarsh is seeking the right individual with a passion for quality, a strong technical foundation and excellent organizational skills to manage and bring excellence, transparency and predictability to our 24x7 production operations. If you are an individual who understands the value that shifting traditional production operations to an engineering focus can bring, the people, processes and tools required, and a passion for innovation, then we want you at Smarsh! You must be a self-starter with a proven track record of success building teams, processes and tools in a secure and scalable manner. You must also have experience in managing technical people, process and tools inclusive of infrastructure management, application triage and resolution, monitoring, KPIs, application updates, operational updates, security improvements, hardware upgrades, API changes, client-side updates, DevOps configuration changes and UX/UI changes. You will work collaboratively with all participants in software development projects and is supportive of developers, QA, DevOps and infrastructure. This role’s responsibilities span across on-premise and cloud-based infrastructure, applications and operations.


  • Manage a team of Site Reliability Engineers focused on providing 1st level response for production issues and drive resolution
  • Manage cloud infrastructure and vendors, including capacity costing, growth projections, monitoring, and feedback loops to improve cloud performance
  • Provide proactive cloud and on-premise infrastructure monitoring
  • Continuously research cloud-based infrastructure evolution and provide feedback loops for improvements, changes, or re-architecture
  • Assure automation and elasticity are functioning properly and infrastructure is utilized appropriately
  • Manage support escalation backlog and prioritization in partnership with support
  • Establish best practices and create runbooks for the SRE team to adopt
  • Create and implement a strategy to ensure we meet uptime and performance SLAs
  • Work with DevOps and Engineering teams to ensure the stability and reliability of our platform
  • Work with the right monitoring tools and triage network, servers, databases, and applications
  • Maintain all operational excellence and assure consistency and security throughout
  • Continually innovate and reduce technical debt
  • As required, perform troubleshooting and root cause analysis on release issues
  • Maintains a release repository and manages key information such as build and release procedures, dependencies, and notification lists
  • Manage risks and resolve issues that affect scope, schedule, quality and operational readiness
  • Manage relationships and coordinate work between different teams at different locations
  • Establish meaningful and actionable KPIs
  • Provide feedback loops into development backlog based on operational measures


  • 3+ years in leading technical practices and teams for a SaaS provider
  • Strong technical understanding of enterprise application technology architecture, components, databases (SQL & NoSQL), networking, infrastructure as code, etc.
  • 3+ years’ experience working in an Agile software development environment and strong understanding of SDLC best practices
  • Experience leading 24x7, on-call organizations
  • Experience with cloud providers (AWS, Azure, etc.)
  • Strong understanding of project management principles desired
  • Knowledge of custom software, 3rd party integration program
  • Experienced in Transition from development stage to support and production readiness
  • Strong experienced in day to day production reality – availability, root cause and correlation analysis.
  • Knowledge of configuration& release management processes
  • Ability to proposed alternative approach to improve efficiency and accuracy
  • Demonstrated experience in working in a matrix organization with extremely tight time frames.
  • Knowledge of ITIL processes
  • Strong experience in managing and driving releases using Agile principles
  • Demonstrated ability to coordinate cross-functional work teams toward task completion
  • Demonstrated effective leadership and analytical skills
  • Smart, driven and problem solver
  • Highly motivated self-starter with the ability to prioritize
  • Excellent communication skills and can collaborate well with peers and business users
  • Ability to perform well under pressure in a fast-paced environment with a high sense of urgency
  • Care deeply about delivering quality software

The ideal candidate:

  • Experience with complex, highly-integrated, legacy systems
  • Experience with an on-prem to cloud migration
  • Hands-on experience with Windows and Linux systems
  • Comfortable communicating in writing and in person with technical and non-technical stakeholders
  • Experience implementing monitoring and alerting solutions
  • Ability to keep the team moving on long-term strategic initiatives while remaining responsive to high priority escalations
  • Has a "how can this be automated?" mindset
  • Ability to motivate, mentor and drive teams to succeed
  • Can build relationships with supporting teams, both within and outside the technology org