Sr. SRE Platform Architect

Bitdeer Technologies Group

$130K — $180K *
Enterprise Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years of production SRE, platform engineering, or infra architecture, including 3 years at an architect level.
  • Hands-on experience with GPU/AI-compute infrastructure such as NVIDIA GPU operations and HPC storage.
  • Expertise in multi-region observability at scale including metrics, logs, and analytics.
  • First-hand experience with Kubernetes and at least one other cluster management tool.
  • Knowledge of data-center operations including vendor management and lifecycle controls.
  • Strong instincts in domain-driven design (DDD) including bounded contexts and contract definitions.
  • Proficient in plugin framework design with a track record of building extension frameworks.
  • Exceptional writing skills for technical documentation and executive communication.

Responsibilities

  • Own the architecture for the NeoCloud SRE platform, overseeing all tiers and frameworks.
  • Write and maintain the platform architecture document, ensuring coherence across components.
  • Review framework-level changes for architectural consistency and quality.
  • Establish design invariants and operational standards for the platform.
  • Run and evolve the plugin framework, maintaining uniform contracts for extensions.
  • Decide data tier placements considering residency and compliance requirements.
  • Coordinate with cloud-service teams to define contracts and integration points.
  • Collaborate with security teams on vulnerability and exposure management.
  • Produce design documents for new capabilities that adhere to existing models.
  • Defend architectural integrity against scope creep and ensure compliance with framework standards.

Benefits

  • Opportunity to lead the architectural design of a next-generation cloud platform.
  • Engagement with cutting-edge AI infrastructure and large-scale deployments.
  • Collaboration with global teams and cross-functional partners.
  • Hands-on involvement in decision-making for innovative technologies.
  • A dynamic work environment that encourages strategic thinking and problem-solving.
Full Job Description
Position Overview

Bitdeer is seeking a visionary and hands-on Cloud SRE Architect to lead the design, development, and evolution of our next-generation public cloud platform. This role will oversee the end-to-end architecture across CPU, GPU, RDS, storage, networking, serverless, and AI services, ensuring global scalability, reliability, and performance. The ideal candidate is a strategic thinker with deep technical expertise in cloud infrastructure, platform engineering and AI systems, capable of bridging architecture vision with real-world engineering execution. You will collaborate closely with cross-functional teams and global partners to define our cloud technology roadmap, optimize multi-region deployments, and deliver world-class infrastructure and platform solutions that power large-scale AI and enterprise workloads.

Key Responsibilities

Own the end-to-end architecture of the NeoCloud SRE platform - the substrate that observes, protects, and operates a multi-region GPU rental fleet across self-built and OEM-rented data centers. You are the single point of architectural accountability across the platform's ~57 bounded contexts, ~12 frameworks, and three operational tiers (Edge DC → Regional Controller → Global Hub).

This role is for someone who writes the design, defends it under review, and shepherds it through the engineering squads that build it.

What You'll Do

  1. Write and maintain the platform architecture document - keep the design coherent across all sections, frameworks, and tiers. The current document is your starting point.
  2. Review every framework-level change - new bounded context, new plugin kind, tier-deployment shift, schema change, naming change, cross-context contract change. Architecture changes ride GitOps PRs like any other artifact.
  3. Set design invariants - residency rules (raw data stays in Region), Tier 2 self-sufficiency budget (≥ 24 h), survival-uplink contracts, naming conventions, SLO catalogues, redaction-at-boundary rules.
  4. Run the plugin framework - every extension uses one uniform contract (Common + Domain manifest, lifecycle, observability). You author and evolve this contract.
  5. Decide tier placement - what runs at Edge DC vs Regional Controller vs Global Hub, with data-residency / compliance / availability tradeoffs explicit.
  6. Coordinate with cloud-service teams and tenants - they author plugins, SDKs, dashboards, agent recipes that ride the platform. You set the contracts they consume.
  7. Coordinate with Security - joint ownership of vulnerability management, exposure management, joint operations. Security owns policy and risk acceptance; you own the operational mechanisms they ride.
  8. Pre-flightroadmap items - for any new capability, produce a one-page design that fits the existing layered model (L1-L6), tier topology, naming conventions, and extension contracts before implementation starts.
  9. Defendthe design under review - say no to scope creep, special-case workarounds, and one-off integrations that don't fit the framework model. Say yes when a new plugin kind is genuinely needed.

Qualifications
  • 10+years of production SRE / platform-engineering / infra-architecture, including ≥ 3 years at architect level.
  • Hands-on with GPU / AI-compute infrastructure - NVIDIA GPU ops (DCGM, MIG, vGPU, NVLink/NVSwitch, XID semantics, NCCL), InfiniBand or RoCE fabrics (subnet manager, fabric partitioning, optical health), HPC storage (Lustre, NetApp/Pure/DDN/VAST, NVMe-oF).
  • Multi-region observability at scale - metrics / logs / traces / profiles / analytics-lake substrate; recording rules, MWMBR burn-rate alerting, SLI/SLO discipline.
  • Cluster platforms - first-hand experience with Kubernetes (control plane + GPU Operator + topology-aware scheduling) AND at least one of Slurm / Volcano / Kueue / Ray / KubeRay.
  • Data-center operations - ZTP, BMC/IPMI/Redfish, BIOS/firmware lifecycle, RMA, multi-vendor OEM management (self-built + leased DC mix).
  • Strong DDD instincts - bounded contexts, public contracts, no shared databases, one-context-one-repo discipline.
  • Plugin framework design - you have built (or substantively contributed to) a real extension framework with a uniform manifest + lifecycle.
  • Writing fluency - you can author and maintain a multi-thousand-line architecture document under review without it drifting; you can also write a one-pager an executive will read.
  • Cross-team operating tempo - design reviews, runbook authorship, on-call shadowing, post-mortem facilitation
  • Hyperscale or NeoCloud experience
  • BS/MS in Computer Science or similar


Similar Jobs

More Jobs at Bitdeer Technologies Group

More Enterprise Technology Jobs

Find similar Sr. SRE Platform Architect jobs: