Staff Software Engineer (Platform Architecture & Execution Model)

Red Cell Partners

$180K — $245K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years experience in distributed/platform systems
  • Experience in mission-critical runtimes or orchestration systems
  • Deep knowledge of durable execution principles
  • Proven track record with security and governance in production
  • Hands-on with observability tools like Grafana
  • Strong systems design across storage and event-driven architectures
  • Expertise in modern programming languages and cloud-native technologies

Responsibilities

  • Develop the core execution model for Trase OS
  • Design platform APIs and manage versioning
  • Guarantee correctness in workflow execution
  • Engineer reliability and scalability in the platform
  • Integrate security measures into the architecture
  • Deliver observability across the platform
  • Drive architectural standards and mentor other engineers

Benefits

  • Career advancement opportunities with strong performance
  • 100% employer-paid comprehensive healthcare
  • Paid maternity and paternity leave for 14 weeks
  • Unlimited PTO with management approval
  • Professional development opportunities available
  • Optional 401K, FSA, and equity incentives
  • Access to mental health benefits and GLP-1 solutions
Full Job Description
About The Role

As Staff Software Engineer, you'll own the core execution model and platform architecture of Trase OS - the shared platform ("agentic operating system") that powers all Trase deployments in regulated environments. You'll define the abstractions and APIs that connect workflows, agents, tools, and product surfaces, and ensure the correctness, scalability, and extensibility of the system.

This is a company-critical role: you are responsible for how the system behaves under real-world conditions, including failure, scale, and security constraints. Your work sets the technical direction for the platform and acts as a force multiplier across all engineering teams.

Clean abstractions and correctness-under-failure are critical because we operate long-lived agents in healthcare/defense environments where auditability and reliability are non-negotiable.
Why This Role Is Needed

Trase OS is an orchestration-heavy system coordinating long-lived workflows, agents, and tools across multiple services and environments.

As the platform evolves, the primary risks shift from implementation to system design quality:
  • Poor abstractions create tight coupling across services
  • Workflow execution becomes difficult to reason about under failure
  • Platform capabilities fragment instead of becoming reusable primitives
  • Scaling introduces complexity instead of leverage

This role exists to:
  • Define clean, durable abstractions for the platform execution model
  • Ensure correctness and determinism in workflow execution
  • Translate evolving product requirements into coherent platform architecture
  • Enable teams to build on Trase OS without introducing systemic complexity
What Makes This Role Hard
  • You are designing systems where failure is the norm, not the exception, and correctness must be preserved across retries, restarts, and partial execution
  • You must balance clean abstractions with real-world constraints (performance, security, multi-tenant environments)
  • Decisions made here become foundational primitives used across all products and teams
  • The system must remain understandable and auditable, even as complexity and scale increase
Responsibilities
  • Develop the core execution model (state machine, lifecycle, resource model, failure semantics)
  • Design platform APIs/SDKs connecting workflows, agents, tools, and product surfaces; drive versioning & compatibility
  • Guarantee correctness via idempotency, deterministic replays, compensating actions, and data integrity
  • Engineer reliability at scale: concurrency controls, rate limits, backpressure, sharding/partitioning, and workload isolation
  • Build security & governance into the core: RBAC/ABAC, policy enforcement, fine-grained audit & lineage
  • Deliver observability: distributed tracing, structured logs, metrics, and evaluation hooks; build an "explainable trail" of agent actions
  • Own quality: design reviews, test strategy (unit, property, chaos), performance baselines, SLOs, incident response, and postmortems
  • Mentor & unblock senior engineers; partner with Product, Security, and Customer teams to translate requirements into durable primitives
  • Make pragmatic choices on storage, queueing, and compute; create paved roads that accelerate all other teams
  • Define system boundaries and reduce cross-service coupling through clear architectural patterns
  • Drive platform-wide standards for correctness, reliability, and API design across teams
  • Balance short-term delivery with long-term architectural integrity, ensuring the platform evolves without accumulating systemic risk
Requirements
  • 10+ years of experience building distributed/platform systems, including significant experience defining architecture across teams or domains
  • Experience building mission-critical runtimes or workflow/orchestration systems
  • Deep expertise with durable execution (e.g., state machines, event sourcing, saga/compensation, idempotency, exactly/at-least-once semantics)
  • Proven track record with security & governance in production systems (auth, RBAC, audit, policy)
  • Hands-on with observability (Grafana or equivalent), including trace correlation across async boundaries
  • Strong systems design across storage, queues, schedulers, and evented architectures; performance tuning under load
  • Excellence in a modern language (e.g., Go, Rust, Java, or TypeScript) and cloud-native stacks (containers, CI/CD, IaC)
  • Comfortable operating in regulated or high-assurance environments; bias toward correctness, clarity, and documentation
  • Proven ability to influence technical direction across an organization and drive adoption of architectural standards
  • Ability to incorporate advance LLM capabilities into system design and platform architecture decisions where appropriate
Nice to Have
  • Prior work on workflow engines (Temporal/Cadence/AWS Step Functions, Argo, Airflow) or serverless runtimes
  • Experience with policy engines (OPA), secrets/KMS, or data-handling controls (PII/PHI)
  • ML/LLM evaluation frameworks, tool/plugin architectures, or embedding model governance into execution
  • Government or healthcare experience (HIPAA, audit readiness) and multi-tenant isolation

Salary Range: $180,000-245,000. This represents the typical salary range for this position based on experience, skills, and other factors.

#LI-RCP

Our Red Cell Partners Benefits:

For full-time roles
  • Career track opportunity with potential for rapid advancement with strong performance as the firm grows
  • 100% employer paid, comprehensive health care including medical, dental, and vision for you and your family.
  • Paid maternity and paternity for 14 weeks at employees' normal pay.
  • Unlimited PTO, with management approval.
  • Opportunities for professional development and continued learning.
  • Optional 401K, FSA, and equity incentives available.
  • Mental health benefits are available through Tara Mind.
  • Cost effective GLP-1 solutions available through Crux.

Similar Jobs

More Jobs at Red Cell Partners

More Information Technology Jobs

Find similar Staff Software Engineer (Platform Architecture & Execution Model) jobs: