We are seeking an experienced SRE specialist to join our SRE Practice team. As a key contributor to the reliability strategy, you will be responsible for implementing and maintaining a comprehensive reliability solution for on-premises and cloud applications and services. You will work collaboratively with various IT teams to enable and guide them in achieving site reliability.
Requirements
- Solid expertise on the topic of IT reliability
- Extensive experience with application performance management, IT infrastructure monitoring, and user experience monitoring.
- Technical leadership experience.
- Enterprise application, systems, and network monitoring expertise for on-premises and cloud applications.
- Hands-on experience with Dynatrace, Elastic Search, and ServiceNow in instrumenting applications end-to-end with minimal supervision.
- Solid knowledge of AI-OPS, anomaly detection, and event correlation solutions.
- Comfortable with scripting or programming languages (Java, C++, GO, Python)
- Experience with open telemetry.
- Good knowledge of infrastructure protocols to gather element-level event data.
- Good knowledge of open-source monitoring technologies.
- Proficient with data lifecycles and aggregation, reporting, and web dashboards.
- Proficient in ITIL event management and good basis in ITIL foundational concepts.
- Hands-on experience with continuous integration tools.
- Deep knowledge of reliability and Site Reliability Engineering (SRE).
- Infrastructure and Networking: The candidate should be familiar with advanced networking tools like F5, Citrix, Cloudflare, etc. and be able to design custom hardware and software networking solutions.
- Troubleshooting: The candidate should be proficient with advanced log analysis tools like Dynatrace and be able to develop and maintain automated testing and deployment tools.
- Cloud Computing and Virtualization: The candidate should have hands-on experience with AWS, GCP, Azure, VirtualBox, Docker, Kubernetes and advanced cloud infrastructure tools like Terraform, Puppet, or Chef.
- Distributed Systems and Scalability: The candidate should have knowledge of advanced distributed systems tools like Kubernetes and service meshes, and advanced distributed systems tools like Cassandra, Hadoop, or Spark.
- Security and Compliance: The candidate should have knowledge of advanced security tools like HashiCorp Vault, AWS KMS, or Azure Key Vault and security best practices, firewalls, encryption, SSL/TLS.
Benefits
- A financial rewards program that recognizes your success
- An industry leading Employee Share Purchase Plan; we match 50% of net shares purchased
- An extensive flex pension and benefits package, with access to virtual healthcare
- Flexible work arrangements
- Possibility to purchase up to 5 extra days off per year
- An annual wellness account that promotes an active and healthy lifestyle
- Access to tools and resources to support physical and mental health, embracing change and connecting with colleagues
- A dynamic workplace learning ecosystem complete with learning journeys, interactive online content, and inspiring programs
- Inclusive employee-led networks to educate, inspire, amplify voices, build relationships and provide development opportunities
- Inspiring leaders and colleagues who will lift you up and help you grow
- A Community Impact program, because what you care about is a part of what makes you different. And how you contribute to your community should be just as unique.