Director, Digital Reliability Engineering role at Royal Caribbean Group, leading global Technology Operations portfolio for Digital organization, ensuring reliability, availability, and performance of guest-facing pre-cruise platforms.
Requirements
- Define and execute the global SRE strategy for Digital Operations, aligning with business priorities and Royal Caribbean’s long-term technology vision.
- Build and nurture a culture of reliability, resilience, and continuous improvement across all digital platforms.
- Drive initiatives to maintain zero downtime by rapidly addressing issues, conducting root cause analysis, and implementing remediations.
- Lead global site reliability and operations teams across onshore, nearshore, and offshore locations while actively engaging in day-to-day challenges.
- Actively participate in major incident response, including log analysis, recovery validation, and executive updates.
- Lead problem bridges, collaborating across technical and functional teams for timely issue resolution.
- Partner with engineers to diagnose, troubleshoot, and resolve critical issues in real time, demonstrating technical credibility.
- Strengthen ITSM processes (Incident, Problem, Change, Major Incident) using tools like ServiceNow, PagerDuty, and JIRA.
- Lead engineering support for production issue remediation, ensuring timely root-cause analysis, resolution, and prevention of recurring problems.
- Manage and prioritize ongoing maintenance activities, patches, upgrades, and operational improvements across the digital technology stack.
- Establish strong feedback loops with product and engineering teams so that recurring issues and operational pain points are systematically eliminated.
- Work directly with teams to ensure the reliability of a hybrid technology stack spanning: Mobile, Web, Backend Services, Commerce, and Cloud Infrastructure.
- Champion observability and performance practices leveraging platforms such as Splunk, Dynatrace, Prometheus, Quantum Metric / RUM tools.
- Promote automation, chaos engineering, and AI-driven anomaly detection to strengthen system resilience.
- Guide teams in Infrastructure as Code, and modern operational tooling.
- Oversee all environment activities, including new code deployments.
- Recruit, mentor, and develop global SRE talent while modeling hands-on technical engagement.
- Manage vendor and partner teams with the same “roll-up-your-sleeves” approach as internal teams.
- Deliver executive-ready dashboards and insights to communicate the health of digital operations.
- Own and manage the Operational Expenditure (OPEX) budget for Digital Operations, ensuring efficient allocation of resources while balancing reliability, scalability, and cost optimization.
- Provide transparency into operational spend through regular reporting and executive updates.
- Partner with Finance and Procurement to negotiate, track, and optimize vendor contracts and third-party services.
- Ensure budget discipline while identifying opportunities for automation and efficiency improvements to reduce operational costs without compromising reliability.
Benefits
- Competitive compensation and benefits package
- Excellent career development opportunities
- Global experience
- Resiliency mindset
- Leadership by example
- Strategic thinking
- Maintenance and communication
- Engineering collaboration
- Communication
- Financial responsibilities
- Working conditions