Sr. SRE Platform Software Engineer

Bitdeer Technologies Group

• $120K — $160K *

Austin, TX 78745In-Person

Enterprise Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

7+ years of production software engineering experience with on-call duties.
Mastery of a systems-grade language (Go preferred, Rust, or Java) and Python for tooling.
Strong understanding of distributed systems concepts and trade-offs.
Experience with observability stacks like Prometheus and Elasticsearch at production scale.
Hands-on with GitOps and CI/CD tools such as Argo and Flux.
Proven ability in writing Kubernetes controllers for production traffic.
Experience in mTLS and secrets management using tools such as HashiCorp Vault.

Responsibilities

Lead the architecture design for a next-generation public cloud platform.
Oversee the development of various infrastructure components, including storage and networking.
Optimize deployments across multiple regions for reliability and performance.
Collaborate with cross-functional teams to define and execute cloud technology roadmaps.
Enhance observability and monitoring using metrics and logs collection services.
Implement fault prediction and remediation processes to ensure system resilience.
Manage hardware lifecycle operations and ensure data center protocols are followed.

Benefits

Work in a hands-on leadership role with a visionary approach.
Collaborate with cross-functional teams and global partners.
Engage with cutting-edge technologies in cloud infrastructure and AI.
Focus on large-scale enterprise workloads, providing meaningful impact.
Work in a dynamic environment that values innovation and strategic thinking.

Full Job Description

Position Overview

We are seeking a visionary and hands-on Cloud SRE Architect to lead the design, development, and evolution of our next-generation public cloud platform. This role will oversee the end-to-end architecture across CPU, GPU, RDS, storage, networking, serverless, and AI services, ensuring global scalability, reliability, and performance. The ideal candidate is a strategic thinker with deep technical expertise in cloud infrastructure, platform engineering and AI systems, capable of bridging architecture vision with real-world engineering execution. You will collaborate closely with cross-functional teams and global partners to define our cloud technology roadmap, optimize multi-region deployments, and deliver world-class infrastructure and platform solutions that power large-scale AI and enterprise workloads.

Key Responsibilities
You will own 1-2 of these:

Collection & Storage: collection-agent, customer-sdk-gateway, metrics-store, logs-store, traces-store, profiles-store, analytics-lake, enrichment-service, collection-monitor.
Alert, Correlation & SLO: alert-engine-framework, alert-correlation, slo-framework, default M-series alert rules.
Topology, Cluster-Health & Cluster Platform Services: topology-service, cluster-health-rollup, OSS-SRE-tool collection plugins for K8s, Slurm, Ray, Volcano, Kueue, and KubeRay.
Fault-Prediction: prediction-engine-framework and built-in predictors (GPU, Link, Disk, XPA, Straggler, SDC, Stranded GPU).
Remediation, Workflow, Inspection & Jobs: remediation-actuator, orchestration-substrate (workflow engine), inspection-orchestrator, job-scheduler, NCCL-baseline inspection probe.
Hardware Lifecycle & DC Ops: hardware-lifecycle, dc-operations, boot-provisioning, rolling-upgrade, bare-metal-bmc-service, auto-discovery, ZTP D0-D5 pipeline, IPMI bare-metal management.
Identity, Secrets, Tenant-Config & CMDB: iam-service, secrets-service, tenant-sre-config, cmdb-cache, schema registry.
Customer-Bridge, Ticketing & SRE Platform Portal: customer-bridge, customer-ticketing, sre-operation-system, Customer Console BFF, SRE Console BFF.
Backup, DR & Meta-Monitor: backup-orchestrator, meta-monitor, external-watcher integration (Datadog or equivalent).
CI/CD, GitOps, Plugin Framework & SRE Image Registry: cicd-pipeline, gitops-sync, plugin-registry, sre-image-registry.
Self-Improving Agent: agent-control-plane, agent-discovery, agent-codegen, agent-sandbox, per-Region LLM gateway.
Global SRE Management: maintenance-window-orchestrator, change-management, capacity-planner, cost-optimizer, gpu-efficiency-dashboard, network-stability-dashboard, patching-orchestrator, artifact-management, compat-matrix-service, security-platform.

Qualifications

Software Engineering Experience: 7+ years of production software engineering experience, including 2 or more years operating what you built (real on-call experience, not just shipping code).
Programming Languages: Production-depth mastery of at least one systems-grade language-Go (preferred), Rust, or Java. Proficiency in Python for tooling and SDK work.
Distributed Systems Fundamentals: Strong grasp of at-least-once vs. exactly-once trade-offs, idempotency, back-pressure, leader election, consistent hashing, gossip, and fan-out. Ability to evaluate CRDT vs. Raft vs. Paxos and select the right tool for the job.
Multi-Region Observability Stack: Experience at production scale with Prometheus, VictoriaMetrics, Mimir, Thanos, Loki, Elasticsearch, Tempo, Jaeger, or OpenTelemetry. Must have built or substantively contributed to the ingest, query, or storage paths of these systems.
GitOps & CI/CD: Hands-on experience with Argo, Flux, Helm, Kustomize, Cosign signing, signed-bundle promotion, and blast-radius-aware rollouts.
Kubernetes Operator Pattern: Proven experience writing a controller or CRD handling real production traffic, with a deep understanding of watch-cache mechanics, leader election, and reconcile loops.
mTLS & Secrets Management: Experience executing end-to-end mTLS bootstrap with certificate rotation. Hands-on experience with HashiCorp Vault or cloud KMS (AWS KMS / GCP KMS).
SQL & Time-Series Data: Ability to read a Prometheus query plan, build a recording-rule strategy, and write SQL that joins per-tenant telemetry against analytics-lake tables.
Testing Discipline: Rigorous approach to unit, integration, contract, chaos, and soak testing. Experience writing and maintaining your own comprehensive tests.
Technical Writing Fluency: Ability to author clear design docs that align with existing platform architecture, create runbooks optimized for 3 AM on-call responses, and write intent-driven PR descriptions.

Preferred Qualifications (GPU / AI-Infra Context)
Experience in at least one of the following areas is a strong plus:

NVIDIA Internals: Deep understanding of DCGM and NVIDIA driver internals, including XID semantics and MIG / vGPU partitioning.
Networking & Fabrics: Experience with InfiniBand or RoCE fabrics, including subnet managers, partitioning, optical health, and NCCL collective tracing.
HPC Storage: Experience managing Lustre, NetApp, Pure, DDN, VAST, or NVMe-oF under multi-tenant loads.
Hardware Management: Hands-on experience with BMC, IPMI, and Redfish at OEM scale (Supermicro, Dell, HPE, Lenovo).
Cluster Platform Internals: Familiarity with Kubernetes GPU Operator, Slurm controller, or Ray GCS.
BS/MS in Computer Science or similar
Hyperscale or NeoCloud experience

* Ladders Estimates

Similar Jobs

Senior Cloud Engineer
$81K — $136K *
TTM Technologies Inc
Remote
Reposted Today
Sr. SRE Platform Architect
$130K — $180K *
Bitdeer Technologies Group
Austin, TX 78745 (Travis County)
Today
Principal Cloud Engineer
$160K — $190K *
Greystar Worldwide, LLC
Remote
Today
Platform Engineer
$100K — $130K *
Kestra Holdings
Austin, TX 78745 (Travis County)
Reposted Today
Solution Architect
$115K — $150K *
Insight Enterprises Inc
Remote
Today
Kubernetes Architect
$105K — $145K *
Maximus
Jbsa Randolph, TX 78150 (Bexar County)
Yesterday

Get Ready For Your
Next Interview

More Jobs at Bitdeer Technologies Group

Sr. SRE Platform Software Engineer
$120K — $160K *
Austin, TX 78745 (Travis County)
Today
Enterprise Technology
In-Person
Sr. SRE Platform Software Engineer
$130K — $180K *
San Jose, CA 95123 (Santa Clara County)
Today
Enterprise Technology
In-Person
Sr. SRE Platform Architect
$130K — $180K *
San Jose, CA 95123 (Santa Clara County)
Today
Enterprise Technology
In-Person
Sr. SRE Platform Architect
$130K — $180K *
Austin, TX 78745 (Travis County)
Today
Information Technology
In-Person
Global Customs Compliance Lead (Bilingual: English/Mandarin)
$90K — $130K *
Los Angeles, CA 90011 (Los Angeles County)
6 days ago
Business Services
In-Person

More Enterprise Technology Jobs

Data Architect SME
$135K — $216K *
Peraton
Bowie, MD 20721 (Prince Georges County)
Today
Lead Developer
$100K — $130K *
Intercontinental Exchange Holdings, Inc.
Atlanta, GA 30349 (Fulton County)
Today
Workday Partner Relationship Manager
$120K — $158K *
HR Acuity LLC
Remote
Today
Senior Systems Developer (San Antonio, Dallas, or Houston)
$100K — $130K *
H-E-B
San Antonio, TX 78228 (Bexar County)
Today
Software Development Engineer, System and Embedded PCIe and Neuron Link
$143K — $194K *
Amazon
Austin, TX 78745 (Travis County)
Reposted Today

Find similar Sr. SRE Platform Software Engineer jobs:

Nationwide Austin, TX

Sr. SRE Platform Software Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Sr. SRE Platform Software Engineer jobs:

Get Ready For Your
Next Interview