Job SummaryWe are seeking a highly skilled HPC Network Engineer to design, deploy, and optimize high-performance computing (HPC) clusters with a focus on RoCE (RDMA over Converged Ethernet) technologies.
The ideal candidate will lead efforts in network configuration, performance tuning, security implementation, and vendor coordination to support low-latency, high-bandwidth communication across HPC environments.
Key ResponsibilitiesRoCE Network Design and Optimization- Design and configure RoCE networks including switches, adapters, and Ethernet fabrics.
- Optimize network settings such as MTU, buffer sizes, and flow control parameters for peak performance.
- Implement congestion management mechanisms like Priority Flow Control (PFC) and Data Center Bridging (DCB).
- Configure RoCE-aware switches and routers for efficient RDMA traffic routing.
- Monitor and tune network performance using tools like Ethernet Performance Monitoring (EPM) and InfiniBand Performance Monitoring (IPM).
Security and Compliance- Implement security protocols such as MACsec and IPsec to secure RDMA traffic.
- Enforce access controls and certificate-based authentication for secure endpoint communication.
Vendor Management- Coordinate with hardware/software vendors to ensure compatibility and support.
- Define technical requirements and evaluate vendor solutions through PoCs.
- Maintain regular communication with vendors for updates, issue resolution, and performance reviews.
Collaboration and Support- Work with cross-functional teams to support cloud migration and lifecycle management.
- Lead troubleshooting efforts and resolve complex network configuration issues.
- Support RDMA-enabled applications and parallel computing frameworks (e.g., MPI, OpenMP).
Required Qualifications- Bachelor's degree in Computer Science, Information Technology, or a related field.
- Proficiency in RoCE protocols including RoCEv2.
- Experience designing and configuring high-performance RoCE-enabled networks.
- Strong skills in performance tuning, congestion management, and network optimization.
- Familiarity with security measures for RDMA traffic and access control mechanisms.
- Hands-on experience in deploying and managing HPC environments and data center networks.
- Proficiency in network monitoring and troubleshooting tools.
Preferred Qualifications- Advanced degree in a technical field or equivalent practical experience.
- Certifications such as CCIE or vendor-specific RoCE certifications.
- Experience with network equipment from vendors like Juniper, Cisco, Arista.
- Working knowledge of firewalls (stateful/stateless) and Linux/UNIX systems.
- Scripting experience with Python or Ansible for automation.
- Familiarity with DevOps practices and CI/CD pipelines.
Certifications- CCIE or equivalent vendor certifications (preferred).
Education: Bachelors Degree
Certification: RoCE Certification , Cisco Certified Internetwork Expert