This position requires office presence of a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered.
What you’ll do:
In this role, you will focus on understanding why production incidents happen and how to prevent them from recurring. You will analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses.
You will turn incident insights into high-quality postmortems and partner with engineering teams to drive corrective actions and long-term improvements. By combining system-level thinking with data, automation, and AI-assisted analysis, you will help shift the organization from reactive response to proactive reliability and incident prevention. You will partner with engineering and software development teams to implement permanent fix and preventive improvements
What you'll bring:
- Proven experience performing deep RCA for production incidents
- Strong understanding of end-to-end system architecture (cloud, web apps, APIs, databases, infrastructure)
- Hands-on experience with observability tools (logs, metrics, traces)
- Ability to identify patterns and drive preventive actions
- Experience writing clear, structured postmortems
- Ability to analyze operational data using tools, queries, or AI-assisted methods
- Strong systems thinking and problem-solving skills
- Background in QA, test engineering, or automation engineering (strong plus)
- Experience using AI or advanced analytics for incident analysis or pattern detection
- Understanding of distributed systems and failure modes
- Experience with data analysis / visualization tools (e.g., Power BI, Tableau)
- Mindset focused on eliminating recurring issues, not just fixing incidents
- Strong communication skills to explain complex issues clearly
Required:
- 7+ years in Systems Engineering, ITSM, RM/CM
- Background in SRE, Support or QA
- One or more of the following SRE Tools: T-APM, T-Trace, CatchPoint, Grafana
- Hands-on experience and understanding of concepts and tools such as SAFe, Agile, DevOps, CI/CD, Data Analytics, and building Gen AI use cases
- Experience with AI technologies, Python, SQL, data analytics, Power BI and ITSM tools (e.g., ServiceNow)
- Modern Enterprise Release Management/Change Management and ITSM
Preferred:
- BS/BA in Computer Science
- Preferred tools: modern Release Management processes for Agile and DevOps environments
- Jira Align, JSM, Jira Cloud, Git for enterprise RM/CM
- Relevantcertications (SAFe, Agile, DevOps, AI/ML)
Joining our team comes with amazing perks and benefits:- Medical/Dental/Visioncoverage
- 401(k) plan
- Tuitionreimbursement program
- Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
- Paid Parental Leave
- Paid Caregiver Leave
- Additional sick leave beyond what state and local law require may be available but is unprotected
- Adoption Reimbursement
- Disability Benets (short term and long term)
- Life and Accidental Death Insurance
- Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
- Employee Assistance Programs (EAP)
- Extensive employee wellness programs
- Employee discounts up to 50% off on eligible ATT mobility plans and accessories
- ATT internet (and ber where available) and ATT phone.
#LI-Onsite 6 Full-time ofce role-Ready to join our team? Apply today.
Our Principal System Engineering jobs earn between $155,400.00 - $261,100.00 USD Annual. Not to mention all the other amazing rewards that working at ATT offers. Individual starting salary within this range may depend on geography, experience, expertise, and education/training.
Joining our team comes with amazing perks and benefits:
- Medical/Dental/Vision coverage
- 401(k) plan
- Tuition reimbursement program
- Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
- Paid Parental Leave
- Paid Caregiver Leave
- Additional sick leave beyond what state and local law require may be available but is unprotected
- Adoption Reimbursement
- Disability Benefits (short term and long term)
- Life and Accidental Death Insurance
- Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
- Employee Assistance Programs (EAP)
- Extensive employee wellness programs
- Employee discounts up to 50% off on eligible ATT mobility plans and accessories, ATT internet (and fiber where available) and ATT phone
Weekly Hours:
40
Time Type:
Regular
Location:
Dallas, Texas, Middletown, New Jersey, Plano, Texas, USA:GA:Atlanta / 1057 Lenox Park Blvd Ne - Adm:1057 Lenox Park Blvd Ne
Salary Range:
$155,400.00 - $261,100.00