Design, implement, and improve the automation and infrastructure that underpin Sophia Space's orbital compute platform. This role focuses on configuration management, platform bootstrapping, reset workflows, and infrastructure resiliency for distributed compute systems deployed on NVIDIA Jetson hardware.
The position emphasizes practical engineering execution, helping build reliable, repeatable, and observable platform infrastructure that can operate autonomously in constrained environments where physical access is impossible.
Primary Responsibilities- Design, implement, and maintain Ansible-based automation supporting platform configuration and lifecycle management.
- Improve platform bootstrapping, reset, and recovery workflows for highly available K3s-based environments.
- Develop infrastructure patterns that improve reliability, consistency, and operational predictability across deployments.
- Identify and reduce configuration drift, operational complexity, and failure-prone infrastructure behaviors.
- Improve observability and diagnosability of runtime platform services and infrastructure components.
- Support development of self-healing, declarative, and resilient infrastructure capabilities.
- Collaborate with platform, systems, and software engineers to ensure infrastructure aligns with product and operational requirements.
- Develop and maintain technical documentation, operational runbooks, and infrastructure standards.
- Support infrastructure validation, troubleshooting, and root-cause analysis across distributed systems.
Required Skills- 3+ years of experience in DevOps, Platform Engineering, Systems Engineering, Infrastructure Engineering, or similar technical roles.
- Hands-on experience with Ansible or comparable infrastructure automation tools.
- Strong Linux systems administration and troubleshooting skills.
- Experience operating Kubernetes or lightweight Kubernetes platforms such as K3s, RKE2, or MicroK8s.
- Experience building repeatable, idempotent infrastructure automation.
- Ability to troubleshoot distributed systems spanning networking, services, and configuration layers.
- Experience supporting bare-metal, edge, appliance-based, or hardware-backed systems.
- Strong written and verbal communication skills.
- Ability to work independently in a fast-paced startup environment.
Desired Skills- Experience supporting highly available Kubernetes environments.
- Experience designing infrastructure reset, recovery, backup, or disaster-recovery workflows.
- Familiarity with reconciliation, configuration drift detection, or self-healing infrastructure approaches.
- Experience supporting AI, ML, GPU, or distributed compute environments.
- Understanding of Kubernetes networking concepts, including DNS, ingress, service discovery, load balancing, and firewalling.
- Experience with storage systems, persistent volumes, backup/restore mechanisms, or distributed data movement.
- Experience scaling infrastructure in startup or rapidly growing engineering organizations.
Success Outcomes- Runtime infrastructure remains consistently configurable, maintainable, and reliable across development, test, and orbital environments.
- Platform bootstrapping and reset workflows become standardized, repeatable, and operationally predictable.
- Infrastructure failure modes become easier to identify, diagnose, and recover from.
- New platform capabilities integrate cleanly without degrading reliability or maintainability.
- Platform services remain supportable by engineers beyond their original authors, enabling sustainable team growth.
- Customer workloads operate reliably without requiring continuous infrastructure intervention.
The pay range for this role is:
105,000 - 158,000 USD per year (Pasadena, CA)