Within CloudWatch, the CloudWatch Logs team builds and manages some of the largest logging systems in the world. CloudWatch Logs allows AWS users to ingest their application and AWS logs in a highly scalable, durable, and reliable service for enterprise scale; find and analyze the right log data to solve operational problems; generate metrics and alarms from their log events; or send log data to other downstream systems, for example.
The CloudWatch Logs team is growing rapidly and is looking for talented systems engineers or DevOps engineers to work as part of our Seattle-based engineering team. On a typical day, our systems engineers might dive deeply to find the root cause a customer issue, investigate why a metric is trending in the wrong direction, consult with the top engineers at Amazon, or discuss radical new approaches to automate operational issues. You will be surrounded by people who are smart and passionate about operational excellence, availability, resiliency, and efficiency, but more importantly, about customers’ needs.
We are looking for individuals who are excited about building automation, managing fleets that self-heal, and systems that are monitored via metrics, logs, and health checks, all while being passionate about building a discipline of operational excellence and operational visibility that scales with the needs of our customers. The ideal candidate will haveexperience and talent for solving complex problems of scalability and availability in massively distributed systems, working as an integral part of an engineering team as a DevOps engineerresponsible for systems operations, automation, capacity management, automation, release engineering and deployments, performance engineering, and more. This candidate recognizes and adopts best practices in documentation, testing,security, operational support at scale, and the efficient use of resources, and develops appropriate metrics to demonstrate performance or improvements.
· Bachelors Degree in Computer Science or a related field, or relevant work experience.
· 5+ years of Linux experience and associated tools/languages, building and running systems for high availability Internet-facing services, with understanding of how commodity servers, operating systems and networks function, perform and scale.
· 5+ years of development of systems management and administration automation in Perl, Python, Ruby, Bash/Shell or Java.
· Experience building scripts, tooling, and automation that performs well and is safe for large-scale computing environments.
· Excellent troubleshooting and problem analysis skills.
· Experience in 24x7 production environments.
· Excellent communication skills and the ability to work well in a team.
· Advanced degree in Computer Science or an Engineering discipline.
· Ability to drive technical innovation in operations via automation.
· Ability to create processes that enhance operational excellence and workflow.
· Experience in performance engineering or capacity management.
· Prior experience with logging or monitoring services or infrastructure.
· Experience with massively scaled distributed systems.
· Experience with Linux performance testing, profiling, and tuning.
· Working knowledge of SQL and database administration basics.
· Knowledge of TCP/IP networking, architecture, and core technologies (such as DNS, DHCP, HTTP, Routing, VPN).
Job ID: 611237