Job Description:Site Reliability Engineer (SRE) – Grade 4Production Support Engineering
We are looking for engineers who solve operational problems by building software. In this role, you will improve reliability, reduce toil, and enhance production systems by writing code, building automations, and leveraging modern AI-assisted development tools.
This is a hands-on engineering role—not a traditional support position. You’ll use Node.js/TypeScript, Python, and AI tools like GitHub Copilot to design and deliver solutions that make systems more reliable and operations more scalable.
What You’ll Do- Build software solutions to improve reliability, reduce operational toil, and scale production systems—not just respond to issues.
- Develop production-quality code using Node.js / JavaScript / TypeScript and Python/PowerShell, including testing and documentation.
- Leverage modern development tooling such as VS Code and AI-assisted tools (e.g., GitHub Copilot) to accelerate delivery and problem-solving.
- Independently own well-scoped features, fixes, or improvements end-to-end—from design through deployment and operational validation.
- Participate in on-call rotations, respond to incidents, execute runbooks, and ensure clear communication and handoffs.
- Analyze incidents and recurring issues to identify patterns, reduce alert noise, and implement durable fixes.
- Implement and improve observability (logging, metrics, dashboards, alerts) for owned services.
- Build automations, scripts, and lightweight tools to eliminate repetitive manual work and improve operational efficiency.
- Identify and act on opportunities to improve system reliability, performance, and maintainability.
- Develop an understanding of how systems impact customer experience and business outcomes.
The Expertise and Skills You Bring- ~2 plus years of experience in SRE, software engineering, DevOps, or production engineering.
- Strong hands-on coding skills with emphasis on:
- Node.js / JavaScript / TypeScript
- Python for scripting and automation
- Experience building tools, APIs, or automations to solve engineering or operational problems.
- Familiarity with AI-assisted development workflows (e.g., GitHub Copilot, code generation tools) and interest in applying AI/LLMs to improve engineering productivity.
- Foundational knowledge of:
- Monitoring, logging, and observability concepts
- Distributed systems and API-based architectures
- SQL and data analysis for troubleshooting
- Exposure to cloud platforms (AWS or Azure), CI/CD pipelines, and modern development practices.
- Basic understanding of incident management, problem management, and production support processes.
What We’re Looking for in You- A self-starter who proactively identifies key issues and trends, performs thoughtful analysis, and develops creative, high-impact solutions that deliver measurable value.
- A builder mindset, you instinctively solve operational problems by writing code and creating automation.
- A modern engineer who leverages AI tools to increase speed, quality, and effectiveness.
- Strong problem-solving skills with the ability to troubleshoot, analyze, and deliver practical solutions.
- Proactive approach to identifying risks, inefficiencies, and improvement opportunities.
- Curiosity and desire to continuously learn systems, tools, and reliability engineering practices.
- Clear communicator who collaborates effectively across engineering and operations teams.
- Growing autonomy and consistency aligned with progression toward a Grade 5 SRE role.
Certifications:Category:Information Technology