We are seeking a seasoned Observability Lead to drive the strategy, implementation, and evolution of observability and AIOps capabilities across our enterprise IT landscape.
Requirements
- Lead Observability Strategy: Define and execute the observability roadmap aligned with business and IT goals, integrating AIOps and SRE principles.
- Tool Ownership & Integration: Manage and optimize observability tools including OpsRamp, Splunk, AppDynamics, NetBrain, ThousandEyes, and explore new platforms like BigPanda and ServiceNow AIOps.
- Automation Leadership: Drive automation of L1/L2 operational tasks using Python and PowerShell, improving efficiency and reducing manual intervention.
- SRE Adoption: Collaborate with cross-functional teams to implement Site Reliability Engineering (SRE) practices, including SLIs/SLOs, error budgets, and incident response automation.
- Monitoring & Dashboarding: Design and maintain comprehensive dashboards and alerting mechanisms for infrastructure, applications, and network performance.
- Incident & Problem Management: Partner with ITSM teams to enhance incident detection, root cause analysis, and resolution workflows.
- Mentorship & Collaboration: Lead and mentor a team of observability engineers, fostering a culture of innovation, ownership, and continuous improvement.