Technical Operations Lead

Karsun Solutions, LLC

$160K — $175K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years of IT work experience
  • 7+ years in technical operations/SRE for data-intensive systems, with 3+ years in AWS
  • Deep knowledge of data products: data lineage, stewardship, SLAs
  • Experience with data platforms: Databricks, Airflow, S3, Kafka/Kinesis
  • Strong understanding of SRE practices: SLI/SLO design, incident response
  • Hands-on experience with observability tools: Prometheus, Datadog
  • Familiarity with IaC (Terraform/CloudFormation) and CI/CD processes
  • Solid federal compliance experience with RBAC and encryption

Responsibilities

  • Oversee platform availability and reliability, developing operational runbooks.
  • Collaborate with SRE Lead to implement SRE practices and transition teams from DevOps to SRE.
  • Build and maintain observability for data and AI stacks, including metrics and centralized logging.
  • Lead incident management processes, including on-call rotations and remediation tracking.
  • Automate operational workflows using IaC and CI/CD to minimize manual toil.
  • Define and enforce runbooks and disaster recovery plans for data and ML systems.
  • Coordinate with data product owners and engineers to ensure compliance and production readiness.

Benefits

  • Comprehensive health, dental, and vision insurance options
  • 401(k) retirement plan with company matching
  • Generous paid time off and holiday schedule
  • Educational assistance and professional development opportunities
  • Flexible work environment supporting work-life balance
Full Job Description
Summary

This individual will lead technical operations for a cloud-native (AWS) data and AI platform supporting a federal program; own reliability, observability, incident response, platform engineering, and data-product operationalization.

What You'll Be Doing:
  • Serve as primary technical owner for platform availability, reliability, and operational runbook development for data pipelines, feature stores, model serving, and supporting infrastructure.
  • Work closely with the SRE Lead to design and operationalize SRE practices (SLIs/SLOs/SLAs, error budgets, toil reduction) to transition teams from DevOps to SRE.
  • In collaboration with SRE Lead, build and maintain monitoring, alerting, and observability across data and AI stacks (ETL/ELT, data lakes/warehouses, model training & serving), including metrics, distributed tracing, and centralized logging.
  • Lead incident management: on-call rotations, incident response, RCA, remediation tracking, and continuous improvement.
  • In collaboration with SRE Lead, automate operational workflows (deployments, scaling, recovery) using IaC (Terraform/CloudFormation) and CI/CD pipelines; reduce manual operational toil.
  • Define and enforce runbooks, backup/restore, RTO/RPO, and disaster recovery for data and ML systems.
  • Partner with data product owners, ML engineers, security, and compliance to ensure production readiness, access controls, and federal compliance requirements.
  • Manage capacity planning, cost optimization, and performance tuning of AWS resources for data and ML workloads.
  • Mentor and lead an ops/SRE team; set technical priorities and coordinate cross-functional platform changes.
  • Maintain vendor and third-party integrations and coordinate upgrades/patching under federal change-control processes.
  • Track and report reliability metrics and operational maturity improvements to stakeholders


Required Qualifications:
  • 10+ years of directly relevant IT work experience.
  • 7+ years technical operations / platform / SRE experience supporting data-intensive systems; 3+ years in AWS production environments.
  • Deep understanding of data products and product ownership: data lineage, stewardship, SLAs, and consumer contracts.
  • Proven experience operating data platforms: Databricks, Airflow, S3, , Kafka/Kinesis, Airflow.
  • Strong SRE practice knowledge: SLI/SLO design, incident response, runbooks, chaos/failure-mode testing.
  • Hands-on with observability tooling (Prometheus, , Datadog, OpenTelemetry) and log/tracing systems.
  • Familiar with IaC (Terraform or CloudFormation), CI/CD (GitHub Actions/Jenkins/ArgoCD), container orchestration (EKS/Kubernetes), and scripting (Python, Bash).
  • Solid security and compliance experience for federal environments (RBAC, encryption, secrets management).
  • Excellent written and verbal communication; ability to produce clear runbooks, RCA reports, and brief leadership.

Preferred Qualifications:
  • AWS Certified Solutions Architect - Associate (desirable).
  • Prior experience with ML lifecycle/MLOps tooling (SageMaker, Databricks) and feature stores.
  • Experience migrating teams from DevOps to SRE and driving organizational change.
  • Experience with cost optimization and governance of large AWS data/ML workloads.
  • Familiarity with federal program processes, change control, and procurement cycles.
  • Active federal clearance or ability to obtain one.

Things to Know:

Salary Range

The proposed salary range for this role is $160,000 to $175,000 USD. The salary range provided is a good faith estimate representative of all experience levels. Karsun considers several factors when extending an offer, including but not limited to, the role, function and associated responsibilities, a candidate's work experience, location, education/training, and key skills.

Third Party Resumes: Karsun does not accept unsolicited resumes through or from search firms or staffing agencies. All unsolicited resumes will be considered the property of Karsun and Karsun will not be obligated to pay a placement fee.

Clearance Information

This position requires the eligibility to obtain a security clearance. The Defense Industrial Security Clearance Office (DISCO), an agency of the Department of Defense, handles and adjudicates the security clearance process. More information about Security Clearances can be found on the US Department of State government website: https://www.state.gov/m/ds/clearances/c10978.htm

Location

To be considered for this role, you must reside in one of the following states: CA, CO, DC, FL, GA, IL, MD, NJ, NY, NC, OH, OK, PA, SC, TX, VA, WV.

Applicants must be authorized to work in the U.S. We may consider candidates currently in H-1B status who are eligible for transfer.

Similar Jobs

More Jobs at Karsun Solutions, LLC

More Information Technology Jobs

Find similar Technical Operations Lead jobs: