Full Job Description
We're hiring a Lead DevOps Engineer to raise the standard of how we build, test, deploy, and operate software. This is a hands-on role with strong technical ownership and a developer enablement mindset: you'll reduce deployment friction, improve environment reliability and quality, strengthen observability, and lead incident resolution through to completion.
You'll lead through standards and influence-building reusable automation, mentoring others, and driving improvements across teams (including coordination with EMEA DevOps counterparts).
What you'll do
• Lead automation initiatives that eliminate repetitive tasks and reduce operational toil.
• Build and maintain Ansible automation to provision new environments and keep existing environments up to date.
• Propose and lead platform improvement projects using tools such as Ansible, Rundeck, and CI/CD systems.
• Design and improve CI/CD pipelines and deployment automation with safe rollout/rollback strategies and clear environment promotion.
• Enable developers through reusable "paved road" tooling: templates, golden pipelines, self-service workflows, and guardrails that reduce manual work and tribal knowledge.
• Partner with engineering teams to improve delivery quality through:
• automated integration and regression testing,
• deployment validation and smoke testing,
• reliable and repeatable test/pre-production environments,
• quality gates that catch issues earlier.
• Improve observability across services and infrastructure (monitoring, logging, alerting, tracing), including visibility into deployment outcomes and failures.
• Lead analysis and resolution of production incidents across infrastructure, application, database, and network layers; drive RCAs and prevention work.
• Oversee platform patching and upgrades; plan, schedule, and monitor maintenance tasks.
• Coordinate and implement server/platform changes required by customers and internal teams.
• Document systems and processes, transfer knowledge, and mentor engineers to raise technical standards across the organization.
• Communicate proactively with stakeholders, manage multiple requests, and prioritize work effectively.
What success looks like
• Manual operational work is automated or removed; fewer repetitive tasks and fewer "only one person knows" processes.
• Faster, safer releases with stronger validation, clearer rollback paths, and improved release confidence.
• More reliable environments and improved readiness for new customer onboarding.
• Better visibility into platform health and incidents: higher signal, less alert noise, faster diagnosis and recovery.
• Clear standards and reusable tooling adopted across teams; improved developer experience and reduced deployment friction.
Required qualifications
• Bachelor's degree in Computer Science (or equivalent) or equivalent professional experience.
• 6+ years of RedHat Linux server administration experience, including production troubleshooting and log triage.
• 6+ years of extensive Ansible scripting and automation experience (or equivalent configuration management).
• 6+ years of scripting experience in Bash and Python.
• Experience building and operating CI/CD pipelines and deployment automation.
• Strong troubleshooting skills across distributed systems (infrastructure, application, database, and network layers).
• Strong working knowledge of MySQL (query language required); Postgres experience is a plus.
• Excellent communication skills and the ability to lead through planning, prioritization, and influence.
• Proven ability to context switch, manage multiple stakeholder requests, and deliver reliably under deadlines.
Preferred qualifications
• Container infrastructure design and implementation; experience with Docker; Kubernetes/Helm a plus.
• Experience with Rundeck, Jenkins, SOLR, and/or ETCD.
• Experience with monitoring/logging/alerting and modern observability practices (SLOs/SLIs, change correlation, incident reduction).
• Networking fundamentals (DNS, routing, connectivity troubleshooting).
• Familiarity with standard change management practices (e.g., ITIL).
• General programming knowledge/structure; Java familiarity is a plus.
• Experience with progressive delivery (canary/blue-green), feature flags, IaC (Terraform/CloudFormation), and secrets management (Vault or equivalent).
• Interest in extending observability to automated workflows and AI/agent activity (execution tracing, failures, permissions, cost visibility).
Working style
This role requires strong ownership, organization, attention to detail, proactive stakeholder communication, and a bias toward automation and repeatability. You'll be expected to take ambiguous, high-impact problems through to resolution and leave the platform better than you found it.
The Wise Lead System Engineer functions as an embedded subject matter expert and technical project leader working from within the OCLC Wise development team. OCLC practices a hybrid work location model allowing at least 3 days a week in the office and 2 days remote.