Position OverviewBitdeer is seeking a visionary and hands-on Cloud SRE Architect to lead the design, development, and evolution of our next-generation public cloud platform. This role will oversee the end-to-end architecture across CPU, GPU, RDS, storage, networking, serverless, and AI services, ensuring global scalability, reliability, and performance. The ideal candidate is a strategic thinker with deep technical expertise in cloud infrastructure, platform engineering and AI systems, capable of bridging architecture vision with real-world engineering execution. You will collaborate closely with cross-functional teams and global partners to define our cloud technology roadmap, optimize multi-region deployments, and deliver world-class infrastructure and platform solutions that power large-scale AI and enterprise workloads.
Key ResponsibilitiesOwn the end-to-end architecture of the NeoCloud SRE platform - the substrate that observes, protects, and operates a multi-region GPU rental fleet across self-built and OEM-rented data centers. You are the single point of architectural accountability across the platform's ~57 bounded contexts, ~12 frameworks, and three operational tiers (Edge DC → Regional Controller → Global Hub).
This role is for someone who
writes the design, defends it under review, and shepherds it through the engineering squads that build it.What You'll Do- Write and maintain the platform architecture document - keep the design coherent across all sections, frameworks, and tiers. The current document is your starting point.
- Review every framework-level change - new bounded context, new plugin kind, tier-deployment shift, schema change, naming change, cross-context contract change. Architecture changes ride GitOps PRs like any other artifact.
- Set design invariants - residency rules (raw data stays in Region), Tier 2 self-sufficiency budget (≥ 24 h), survival-uplink contracts, naming conventions, SLO catalogues, redaction-at-boundary rules.
- Run the plugin framework - every extension uses one uniform contract (Common + Domain manifest, lifecycle, observability). You author and evolve this contract.
- Decide tier placement - what runs at Edge DC vs Regional Controller vs Global Hub, with data-residency / compliance / availability tradeoffs explicit.
- Coordinate with cloud-service teams and tenants - they author plugins, SDKs, dashboards, agent recipes that ride the platform. You set the contracts they consume.
- Coordinate with Security - joint ownership of vulnerability management, exposure management, joint operations. Security owns policy and risk acceptance; you own the operational mechanisms they ride.
- Pre-flightroadmap items - for any new capability, produce a one-page design that fits the existing layered model (L1-L6), tier topology, naming conventions, and extension contracts before implementation starts.
- Defendthe design under review - say no to scope creep, special-case workarounds, and one-off integrations that don't fit the framework model. Say yes when a new plugin kind is genuinely needed.
Qualifications- 10+years of production SRE / platform-engineering / infra-architecture, including ≥ 3 years at architect level.
- Hands-on with GPU / AI-compute infrastructure - NVIDIA GPU ops (DCGM, MIG, vGPU, NVLink/NVSwitch, XID semantics, NCCL), InfiniBand or RoCE fabrics (subnet manager, fabric partitioning, optical health), HPC storage (Lustre, NetApp/Pure/DDN/VAST, NVMe-oF).
- Multi-region observability at scale - metrics / logs / traces / profiles / analytics-lake substrate; recording rules, MWMBR burn-rate alerting, SLI/SLO discipline.
- Cluster platforms - first-hand experience with Kubernetes (control plane + GPU Operator + topology-aware scheduling) AND at least one of Slurm / Volcano / Kueue / Ray / KubeRay.
- Data-center operations - ZTP, BMC/IPMI/Redfish, BIOS/firmware lifecycle, RMA, multi-vendor OEM management (self-built + leased DC mix).
- Strong DDD instincts - bounded contexts, public contracts, no shared databases, one-context-one-repo discipline.
- Plugin framework design - you have built (or substantively contributed to) a real extension framework with a uniform manifest + lifecycle.
- Writing fluency - you can author and maintain a multi-thousand-line architecture document under review without it drifting; you can also write a one-pager an executive will read.
- Cross-team operating tempo - design reviews, runbook authorship, on-call shadowing, post-mortem facilitation
- Hyperscale or NeoCloud experience
- BS/MS in Computer Science or similar