ServiceNow is seeking a Senior Staff Machine Learning Engineer - Site Reliability Engineer to contribute to the design, development, and implementation of AI-powered infrastructure and observability features. The role involves collaboration with researchers and engineers to ensure efficient GPU clusters, building reusable code, and ensuring high-quality product delivery. ServiceNow aims to make the world work better by connecting people, systems, and processes.
Requirements
- Experience in leveraging AI in work processes.
- Proficient in LLM-based features and training.
- Experience using AI productivity tools such as Cursor and Windsurf.
- 8+ years of experience with infrastructure and platform operations, deployments, SRE, and DevOps.
- 6+ years of experience operating highly-available distributed workloads on Kubernetes.
- 6+ years of development experience with Python, GoLang, Java or similar languages.
- Experience with DevOps tooling (e.g. Helm / Ansible / Kubernetes / Prometheus /Splunk/ GitLab CI)
- Strong working experience operating distributed systems built on Linux and J2EE
- Experience with software-defined networking, infrastructure as code, and configuration management
- Experience with building software for compliance and security in regulated environments
Benefits
- Competitive salary
- Opportunities for professional development