Job Summary:We are seeking a Contract Site Reliability Engineer to support and enhance the reliability, availability, and performance of our infrastructure. The ideal candidate will collaborate with development and operations teams to build scalable systems using modern cloud technologies while ensuring cost-efficiency. This is a hybrid role based in Plano, TX, offering exposure to large-scale systems and modern DevOps/SRE practices.
Job Responsibilities:- Assist in designing and implementing scalable and reliable systems using Kubernetes, Docker, and Istio
- Monitor system performance and respond to incidents using observability tools like Datadog
- Identify and address performance and scalability improvements proactively
- Create and maintain automation scripts for deployment and monitoring tasks
- Apply GitOps practices for reliable and smooth production deployments using Argo CD
- Collaborate with developers to resolve system reliability issues
- Conduct load testing to ensure stability under expected workloads
- Implement deployment strategies such as A/B testing, canary releases, and traffic mirroring
- Use Helm charts for managing application deployments
- Support and maintain AWS infrastructure, including EKS, Load Balancers, and routing
- Ensure solutions are cost-effective, highly available, and customer-focused
- Participate in on-call rotations and coordinate with global SRE teams
- Contribute to internal documentation and share knowledge across the team
- Support the adoption of SRE best practices across the organization
Required Skills:- 2+ years of experience in Site Reliability Engineering, DevOps, or a related field
- Familiarity with Kubernetes, Docker, and Istio
- Working knowledge of AWS services and infrastructure
- Understanding of monitoring and alerting tools: Datadog, AppDynamics, ELK, Grafana, Prometheus
- Experience with tuning Horizontal Pod Autoscalers (HPAs)
- Familiarity with GitOps practices and Argo CD
- Exposure to deployment strategies: A/B, Canary, Blue/Green, traffic mirroring
- Knowledge of scripting/orchestration tools such as Terraform, Ansible, or equivalents
- Awareness of cloud cost optimization and performance-reliability tradeoffs
- Strong troubleshooting, problem-solving, and decision-making skills
- Ability to work independently and take ownership of assigned tasks
- Organized and detail-oriented with strong documentation habits
- Excellent verbal and written communication skills
- Strong team collaboration and interpersonal skills
Preferred Skills:- Proficiency in Golang or Rust (a plus, not required)
- Demonstrated initiative in adopting new technologies and DevOps practices
- Ability to contribute to a high-standard engineering culture
Education:Bachelor's degree in computer science, Engineering, or a related field (preferred but not mandatory)