OverviewJob Description
We are seeking a Software Engineer III to join our GRIDScaler team. The ideal candidate will be a highly experienced engineer with deep expertise in distributed systems, storage, observability, and Linux internals. This role demands strong hands-on development skills in Go, Rust, and Python, coupled with the ability to design system architecture, lead technical initiatives, and mentor engineers. The candidate will also drive efforts in telemetry, monitoring, APIs, and developer tooling, ensuring the GRIDScaler product continues to scale for enterprise and HPC environments.
Responsibilities
- Architect, design, and review complex software systems and frameworks for scalability, performance, and reliability.
- Provide technical leadership and mentorship to engineers across multiple teams and geographies.
- Define and implement long-term product and system architecture strategies, including telemetry, observability, and monitoring solutions.
- Develop and guide implementation of CLI tools, REST APIs, and automation frameworks for internal and external use.
- Own the design and integration of telemetry and monitoring pipelines (e.g., OpenTelemetry, Prometheus, Grafana, custom agents).
- Collaborate with product management and engineering managers to translate requirements into technical roadmaps and architecture designs.
- Oversee and review code quality, testing practices, and CI/CD pipelines (Jenkins or equivalent).
- Lead complex debugging and performance tuning efforts in distributed and storage systems.
- Represent the team in cross-functional technical discussions with customers, partners, and executive stakeholders.
- Contribute to and validate technical documentation and architectural design documents.
Qualifications
- BS/MS/PhD in Computer Science, Computer Engineering, or related field with 10+ years of software engineering experience (with at least 3+ years in technical leadership/architect role).
- Proven expertise in Linux system programming and strong understanding of Linux internals.
- Deep hands-on programming experience with Golang, Rust, and Python.
- Strong background in distributed systems, storage systems, HPC, or parallel file systems (e.g., IBM Spectrum Scale (GPFS), Lustre, Ceph).
- Experience designing telemetry, observability, and monitoring frameworks for large-scale distributed products.
- Expertise in building scalable APIs and developer-facing CLI tools.
- Familiarity with cloud-native architectures (Kubernetes, containers, microservices) and deployment on AWS, Azure, or hybrid cloud environments is highly desirable.
- Solid knowledge of CI/CD systems (e.g., Jenkins, GitLab CI) and modern DevOps workflows.
- Strong leadership skills: ability to mentor, influence technical direction, and make architecture decisions.
- Exceptional debugging, performance analysis, and problem-solving skills in complex distributed environments.
- Excellent written and verbal communication skills with the ability to articulate technical vision to engineers and leadership.
DDN
Join our dynamic and driven team, where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here, you'll have the opportunity to work across various areas of the company, thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results, both in their work ethic and deliverables, making strong prioritization skills essential. Additionally, we value strong communication skills in all our engineers and researchers, as they are crucial for the success of our teams and the company as a whole.
Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:
- Coding assessment: Often in a language of your choice.
- Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
- Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
- Meet and greet with the wider team.
- Our goal is to finish the main process in 2-3 weeks at most.
#LI-Remote