Platform Operations and Site Reliability Lead

eTelligent Group LLC

$120K — $150K *
Enterprise Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 8+ years managing enterprise production environments.
  • 5+ years supporting AWS cloud operations.
  • Experience with Databricks or enterprise data environments.
  • Background in implementing observability and Site Reliability Engineering practices.
  • U.S. Citizenship required; IRS MBI clearance eligibility preferred.

Responsibilities

  • Lead operations and maintenance for AWS cloud and Databricks services.
  • Manage observability frameworks like monitoring and alerting.
  • Implement Site Reliability Engineering best practices.
  • Coordinate incident responses and service restorations.
  • Develop operational runbooks and automated procedures.
  • Oversee disaster recovery and continuity planning.
  • Support AI-driven monitoring and intelligence capabilities.

Benefits

  • Remote work options available across several U.S. locations.
  • Opportunity to work with cutting-edge technologies in AI and cloud.
  • Engagement with diverse teams and projects in a federal environment.
  • Possibility for career growth in a high-demand field.
Full Job Description
Place of Performance: Remote and/or IRS facilities in Lanham, MD; Martinsburg, WV; Memphis, TN; Washington, D.C.; Austin, TX; Dallas, TX.

Citizenship: US Citizen (MUST)

Security Clearance: Must be eligible to possess MBI (IRS Background Investigation) clearance. Active IRS MBI clearance is preferred.

Role Overview:

The Platform Operations and Site Reliability Lead is responsible for ensuring the reliability, availability, performance, scalability, and operational excellence of the Enterprise Data Platform. The Operations Lead oversees 24x7 platform operations, observability, incident response, disaster recovery, performance optimization, and AI enabled operational automation across AWS and Databricks environments.

Key Responsibilities:
  • Lead operations and maintenance activities supporting AWS cloud infrastructure and Databricks E2 services.
  • Manage observability frameworks including monitoring, logging, tracing, and alerting.
  • Implement Site Reliability Engineering practices including SLIs, SLOs, error budgets, and reliability metrics.
  • Coordinate incident response, root cause analysis, and service restoration activities.
  • Develop operational runbooks, playbooks, and automated remediation procedures.
  • Lead disaster recovery planning, testing, backup validation, and continuity activities.
  • Support AI driven operational intelligence and predictive monitoring capabilities.
  • Track and report service levels, uptime metrics, and operational performance indicators.


Minimum Qualifications:
  • Minimum 8 years managing enterprise production environments.
  • Minimum 5 years supporting AWS cloud operations.
  • Experience supporting Databricks, analytics platforms, or enterprise data environments.
  • Experience implementing enterprise monitoring, observability, and Site Reliability Engineering practices.


Preferred Certifications:
  • AWS Certified SysOps Administrator
  • AWS Solutions Architect Associate
  • Databricks Platform Administrator


Similar Jobs

More Jobs at eTelligent Group LLC

More Enterprise Technology Jobs

Find similar Platform Operations and Site Reliability Lead jobs: