Platform Reliability Lead

Compunnel

$120K — $150K *
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science or related field
  • 5+ years in OMS Technical Operations or Platform Engineering
  • Expertise in Fluent Commerce, particularly Webhooks and Fluent GraphQL API
  • Proficient in Java for debugging and performance analysis
  • In-depth knowledge of RESTful architectures and event-driven patterns
  • Experience with monitoring tools like Datadog or Splunk
  • Strong analytical skills for system log interpretation

Responsibilities

  • Design and implement automated order remediation mechanisms
  • Build advanced telemetry dashboards to monitor performance metrics
  • Develop smart alerting systems to identify stuck orders
  • Act as the technical escalation point for incident resolution
  • Lead root cause analysis for technical incidents
  • Analyze and propose API and database performance optimizations
  • Serve as the primary technical liaison for E-commerce teams

Benefits

  • Mentorship and professional development opportunities
  • Access to cutting-edge technologies and tools
  • Collaborative work environment with cross-functional teams
  • Involvement in strategic decision-making and technical roadmaps
  • Potential for career growth within a dynamic team
Full Job Description
JOB SUMMARY The OMS Platform Reliability Lead is a highly technical role responsible for the health, stability, and automated evolution of the Fluent Commerce Order Management ecosystem. This position leans heavily into Systems Engineering, requiring the ability to read and debug Java extensions, design complex GraphQL mutations, and build automated remediation tools for the "RUN" team. You will manage the technical RUN support team and serve as the bridge between software engineering and IT operations. Your primary focus is to transition from manual support to "Self-Healing" operations by implementing automation for order replays, data deduplication, and predictive alerting. Key Responsibilities Technical Automation & Self-Healing Operations: Order Remediation Automation: Design and implement automated "Order Replay" mechanisms within Fluent Commerce to resolve synchronization failures between event-driven integrations without manual intervention. Enhanced Observability: Build advanced telemetry dashboards (using tools like Splunk, Datadog, or New Relic) to monitor GraphQL query performance, API latency, and webhook success rates. Smart Alerting: Design and tune threshold-based alerting for the RUN team to identify "Stuck Orders" or inventory mismatches before they impact the customer experience. Tooling Development: Script custom utilities using the Fluent Commerce SDK or REST APIs to facilitate bulk updates and system cleanups. Technical Incident Management & Platform Monitoring: Deep-Dive Troubleshooting: Act as the ultimate technical escalation point for incidents requiring code-level analysis of Java custom extensions or complex GraphQL mutations. Root Cause Engineering: Lead technical Root Cause Analysis (RCA) by performing deep-dives into application logs and event-driven architecture to identify architectural bottlenecks. Performance Tuning: Analyze API response times and database interaction patterns to propose platform optimizations to the development team. ITSM Compliance: Oversee the incident management lifecycle, ensuring documentation includes code-level workarounds and technical "bug-fixes" for future reference. Stakeholder & Vendor Engineering Collaboration: Technical Liaison: Serve as the primary technical point of contact for E-commerce and architecture teams to ensure operational requirements are included in the dev roadmap. Vendor Management: Collaborate with Fluent Commerce product engineers to align on platform upgrades and API versioning impacts. Team Leadership: Mentor the RUN support team in technical skills including GraphQL query optimization and Java debugging. Change Management & Release Integrity: Technical Oversight: Validate technical configurations and platform extensions during the release cycle to ensure deployment integrity and performance stability. CI/CD Awareness: Manage version control using GIT, ensuring proper branching strategies for operational hotfixes and configuration changes. Required Qualifications Education: Bachelor's degree in Computer Science, Software Engineering, or a related technical field. Experience: 5+ years in OMS Technical Operations or Platform Engineering, with specific experience in high-volume, event-driven SaaS environments. Fluent Commerce Expertise preferred: Advanced technical knowledge of Fluent Commerce (specifically Webhooks, Essential Rules, and the Fluent GraphQL API) Core Technical Stack: Java: Proficiency in reading, debugging, and identifying performance issues in custom Java extensions. GraphQL: Expert proficiency in query/mutation design, including the use of aliases, fragments, and variables for complex data manipulation. Integration: Comprehensive understanding of RESTful architectures, JSON schemas, and event-driven patterns (Pub/Sub, Kafka, or Event Grid). Observability: Experience with monitoring tools such as Datadog, Splunk, ELK Stack, or New Relic. GIT: Deep experience with repository management and deployment pipelines. Process Knowledge: Strong mastery of ITIL with an SRE (Site Reliability Engineering) mindset-focusing on automation over manual "toil." Analytical Skills: Ability to parse complex system logs and use data to drive proactive stability improvements. Communication: Ability to explain a "race condition" or "API timeout" to a business stakeholder in terms of revenue and customer impact.

Similar Jobs

More Jobs at Compunnel

More Enterprise Technology Jobs

Find similar Platform Reliability Lead jobs: