Principal Kafka Support & Reliability Engineer

Purple Drive Technologies

$130K — $160K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 5-10 years of experience in Kafka support or reliability engineering.
  • Proven expertise in incident management and troubleshooting for Kafka environments.
  • Strong understanding of Kafka performance optimization and scalability risks.
  • Experience with platform stability, monitoring, and alerting enhancements.
  • Familiarity with AWS MSK and Confluent Cloud platforms.
  • Prior engagement in root cause analysis (RCA) with CAPA documentation.”],
  • responsibilities
  • : [”Act as the highest escalation point for critical Kafka production incidents.”,”Perform advanced troubleshooting for broker and partition issues.”,”Analyze Kafka workloads for performance and scalability risks.”,”Diagnose and resolve Kafka stability and resilience issues.”,”Lead configuration management and changes for Kafka environments.”,”Own and document root cause analysis for major incidents.”,”Collaborate closely with various teams for continuous improvement efforts.”],
  • benefits
  • : [”Opportunities for professional growth in a senior technical role.”,”Work in a collaborative environment with cross-functional teams.”,
  • Surroundings that prioritize innovation and continuous improvement.”] } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } }} } } } } } } } } } } } } } } } } } } } } ] } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } ] } }} } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } ] } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } ] } } } } }) } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } ] } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } ] ) } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } ] } } e > e > e > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne > sne<|vq_1679|>{
  • :
  • } : { }
  • ,
  • : } }
  • ,
  • : } }
  • : } }
  • }
  • : }
  • } }
  • ,
  • : }
  • ; }
  • } : } }
  • }
  • .
  • : .
  • : . }
  • : {
  • ,
  • . 2 }
  • .
  • : . . }
  • : VARIES
  • ; } . }
  • ; } }
  • ,
  • : {
  • ;
  • 2 #
  • }
  • : . }
  • . } }
  • . $ }
  • ~ ;
  • . ; ; , .
  • .
  • . ;
  • ;
  • . 2
  • . 2 .
  • ;
  • . ,
  • . .
  • . '
  • .
  • . .
  • ;
  • ,
  • $
  • $
  • ; }
  • ; } . , } }
  • ;
  • .
  • ;
  • ,
  • . .
  • . ;
  • } .
  • ,
  • . }
  • .
  • . 1 .
  • . ;
  • }
  • |
  • |
  • |
  • }
  • .
  • ; # #
  • {
  • . }
  • ; .
  • | .
  • }
  • } ;
  • ;
  • | |
  • } . }
  • ;
  • ; .
  • | |
  • | |
  • ; | |
  • | |
  • | |
  • } drinking . . .
  • |
  • | |
  • .
  • . | }
  • | |
  • ; -
  • | }
  • .
  • | |:
  • } | |
  • . |
  • }
  • ;
  • .
  • ;
  • ;
  • . |
  • , | |
  • .
  • # | |
  • | | |
  • ;
  • }
  • ; | |
  • . | |
  • | |
  • . ,
  • |
  • ; | | | |
  • | | | | |
  • }
  • . | | . |
  • | |
  • | . |
  • |
  • }
  • . .
  • .
  • . |
  • | }
  • | |
  • ;
  • } . | }
  • | |
  • | | |
  • ; | . | |
  • | | . |
  • | |
  • . |
  • ;
  • | |
  • ; | ; | . | |
  • | | | | | |
  • }
  • |
  • ; ; |
  • ; | |
  • | | | _: | | .
  • | | . | ; | . | |
  • . |;
  • ; | |
  • | | |
  • | |
  • |
  • . |
  • | |
  • ; | | . |
  • ;| ; | | | ; . | | | .
  • | |
  • | | |
  • | | ; |
  • } |
  • | |
  • | . | |
  • | | | ; | ;
  • . .
  • | |
  • | | ; |
  • | .
  • | |
  • }
  • | | | |
  • |
  • | |
  • | | .
  • | |
  • | | ;
  • | . | | | |
  • | ; | |
  • | | . | . | |
  • | | ;
  • | | | | | |
  • | | | .
  • | | |
  • | | |
  • | | ; | ; |
  • | | | |
  • ;
  • ; | |
  • | |
  • | ;
  • |
  • | | | | | | .
  • | | |
  • | | ; | ; | |
  • | ; | | |
  • } | | |
  • | | |
  • | | |
  • | ; | |
  • | . ,
  • | | . |
  • ; ; | ; | | | | |
  • | | | | | |
  • | | | ; | ; |
  • ;
  • | | .
  • | ; |
  • | ; | |
  • ;
  • } | ; | |
  • | |
  • | |
  • | ; ; | | | | |
  • | ; | ; | |
  • | | | ;
  • | | | ; |
  • | ;
  • . |
  • |
  • | |
  • ; | | . |
  • ; | ; | | |
  • |;
  • | | | . |
  • |; |
  • .}
  • | ; | | | ; |
  • | | |
  • | ;
  • ; | |;
  • ; ; . |
  • | ; |
  • . | | | }
  • ; | | ;
  • | |
  • ; |
  • ; ; |
  • | ; ; .
  • | | | | | | |
  • | |
  • ; | | |
  • ; | ; | ; ; |
  • |
  • ; | | ;
  • |
  • | | | ` | ; | . |
  • ; |
  • | | | | | ;
  • | |
  • ; | | |
  • . . |
  • | . .
  • ; | | |
  • |
  • | | . |
  • | ;
  • ; | | | | | ;
  • ; | | } |
  • | ; | | | | ; | |
  • | ;
  • ; |
  • | | | |
  • | |
  • | ||=| . |
  • ;
  • | | #{ |
  • . } ;
  • | ; |
  • | . . }
  • | | } | )
  • ; | } |
  • ; | ; ;
  • | | |
  • ; } | ;
  • |
  • | |
  • /| |
  • | | | |
  • | | _
  • | | __
  • | |____
  • | ;
  • ;=
  • | | | .
  • /
  • | | | | |
  • / | /
  • |
  • |
  • |
  • | . |

Responsibilities

  • Act as the highest escalation point for critical Kafka production incidents.
  • Perform advanced troubleshooting for broker and partition issues.
  • Analyze Kafka workloads for performance and scalability risks.
  • Diagnose and resolve Kafka stability and resilience issues.
  • Lead configuration management and changes for Kafka environments.
  • Own and document root cause analysis for major incidents.
  • Collaborate closely with various teams for continuous improvement efforts.

Benefits

  • Opportunities for professional growth in a senior technical role.
  • Work in a collaborative environment with cross-functional teams.
  • Surroundings that prioritize innovation and continuous improvement.
Full Job Description
Overview:

Role: Principal Kafka Support & Reliability Engineer

Location: Canton, MA

Role Descriptions: Tier 3 Incident Management Escalation SupportAct as the highest technical escalation point for Kafka production incidents Sev 1 Sev 2.Lead deep troubleshooting across 1. Broker instability| controller elections| ISR shrinkage2. Under replicated partitions and leader imbalance3. Producerconsumer failures| lag spikes| and rebalance stormsDisk| network| JVM| and request handler saturationProvide hands on remediation for complex issues| including Partition reassignment and leader rebalanceBroker configuration tuningThrottlequota strategies for noisy producers or consumersCoordinate with vendor support during service incidents| providing logs| metrics| and forensic details.Guide Tier 2 teams during major incidents and validate restoration actions.2. Kafka Performance Engineering OptimizationAnalyze Kafka workloads for performance and scalability risks Partition skew and hot partitionsInefficient producer batchingcompressionConsumer lag root cause analysisThread pool| IO| and network bottlenecksRecommend and validate Topic design (partition count| replication factor| retention| compaction)Producer and consumer configuration best practicesQuotas| quotas enforcement| and multi tenant controlsSupport onboarding of high throughput or latency sensitive workloads| ensuring Kafka is correctly sized and tuned.3. Platform Stability| Reliability ResilienceDiagnose and resolve systemic Kafka stability issues Repeated broker failures or flappingMetadatacontroller instability (Zookeeper or KRaft)Recovery issues following failovers or maintenance eventsSupport resilience initiatives Multi AZ cluster health validationReplication and DR strategies (MirrorMaker 2| Replicator| or app level DR patterns)Failover testing and validationDefine and improve Kafka SLOs for availability| durability| and latency.4. Change| Upgrade Configuration LeadershipLead medium to high risk Kafka changes| including Broker and cluster configuration changesPartition expansion or large scale reassignmentTopic policy changes impacting durability or performanceSupport and plan Kafka version upgradesMSK Confluent upgrade cyclesClient compatibility and rollout strategiesParticipate in CAB reviews| assess risk| and design rollback and validation plans.5. Root Cause Analysis Continuous ImprovementOwn RCA documentation for major incidents with clear corrective and preventive actions (CAPA).Identify recurring failure patterns and architectural gaps.Recommend platform-level improvements Automation opportunitiesGuardrails and standardsMonitoring and alerting enhancementsContribute to continuous improvement of runbooks| knowledge base articles| and operational playbooks.

Essential Skills: Role OverviewThe Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support| advanced troubleshooting| performance engineering| and platform stabilization of enterprise Apache Kafka environments. This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA)| complex remediation| and long term prevention. The engineer works closely with Tier 2 operations| Platform Engineering| SRE teams| application teams| and vendor support (AWS MSK Confluent Cloud providers) to ensure Kafka remains a highly reliable| scalable| and secure streaming backbone.

Desirable Skills:

Keyword:

Skills: Digital : Kafka~Digital: Amazon Connect~Digital : Kubernetes Experience Required: 10 & Above

Similar Jobs

More Jobs at Purple Drive Technologies

  • Senior Robotics Engineers
    $120K — $150K *
    Cupertino, CA 95014 (Santa Clara County)
    Manufacturing & Automotive
    In-Person
  • Salesforce Architect
    $130K — $160K *
    Irvine, CA 92620 (Orange County)
    Enterprise Technology
    In-Person
  • Salesforce Architect
    $130K — $160K *
    Irvine, CA 92620 (Orange County)
    Enterprise Technology
    In-Person
  • Data Modeler
    $100K — $130K *
    Los Angeles, CA 90011 (Los Angeles County)
    Finance & Insurance
    In-Person
  • ServiceNow Architect
    $120K — $150K *
    Malvern, PA 19355 (Chester County)
    Enterprise Technology
    In-Person

More Information Technology Jobs

Find similar Principal Kafka Support & Reliability Engineer jobs: