This contract position involves optimizing an existing monitoring and alerting infrastructure using Observability and AIOps tools. The role focuses on evaluating legacy tools, integrating a unified observability stack, deploying and enabling AIOps capabilities, automating incident management, and building dashboards for operational insights. The individual will work with ServiceNow ITOM and CI health status.
Requirements
- Evaluate legacy monitoring and alerting tools (e.g., BMC MainView, SolarWinds).
- Recommend and integrate a unified observability stack using Splunk, Dynatrace, Grafana, and Elastic Stack.
- Ensure end-to-end visibility across infrastructure, apps, and user experience.
- Deploy and enable AIOps capabilities (event correlation, noise reduction, predictive analytics) using Dynatrace and Splunk.
- Enable intelligent alerting and root cause analysis using ML-based models.
- Integrate ServiceNow ITOM for automated incident creation and enrichment.
- Develop automation playbooks and runbooks (Python, PowerShell, Ansible) for common incident types.
- Enable auto-remediation pipelines linked to AIOps events.
- Support auto-scaling, service restarts, and config drift corrections.
- Deploy logs, metrics, traces using Elastic Stack and Dynatrace.
- Define and implement Service Level Objectives (SLOs), error budgets, MTTR/MTTD benchmarks.
- Build dashboards in Grafana, Dynatrace, and ServiceNow Performance Analytics.
- Redesign and automate event, incident, change, and problem management processes.
- Align monitoring workflows with ServiceNow CMDB and CI health status.
- Shift operations from reactive to proactive, leveraging predictive insights.