Software Developer/ Engineer/ Architect

Director, Site Reliability Operations

We are looking for a Director to lead the Site Reliability Operations team for one of the largest and most trusted cloud platforms in the world!

You will lead a team in the Dublin region, supporting our CRM platform. In this position, you will have an opportunity within the SRO leadership team to strategize on our current transformation journey into ever improving, true SRO, and your ideas may help shape the direction of the SRO organization.

Our team provides the Incident Management, Detection, Operational and Software Engineering expertise, as well as any Root Cause Analysis/remediation and other proactive measures to improve the stability of customer performance and minimize TTR. We will draw on your past experience as a people and technical leader leading software engineering efforts, keeping your team balanced and productive.

The successful candidate will be leading a team of Site Reliability Operations Engineers, setting a vision for growth and driving operational transformation, while automating how we do Operations, with a constant goal to reduce toil and fragility. As such, you have a balance of technical expertise, leadership skills and managerial experience. Your operational skills are sufficiently advanced to enable you to set technical direction on incident bridges and marshal resources accordingly, as well as ensuring that investigations follow the appropriate troubleshooting paths, monitoring, triage and change execution remain optimal. As a leader in this role, you demonstrate a strong focus on engineering practices, service ownership, agile leadership and people management skills. Your scope will span the full breadth of our Private Cloud and Hyperforce public cloud infrastructure.

Key Responsibilities

You will be supervising the day-to-day responsibilities of front-line Site Reliability Engineers. The ideal candidate combines software engineering management experience with an agile develop process with both excellent multi-functional communication and organization skills and experience with handling enterprise-scale Internet services. As a technical leader, you will both build the strategy for your team’s role in a larger movement to DevOps principles within Salesforce, and set the tactical direction across multiple teams as you drive investigations within incident investigations. This position will involve encouraging and maintaining positive relationships with other connected areas of the business, ensuring the SRO team are vital partners within a continuous cycle of engineering and process improvements.

You will be responsible for:

  • Incident Management - Act in key support roles during major incidents e.g. Sev0, Sev1, Sev2.
  • Resiliency Engineering - Drive the team as well as partner product teams to populate and participate in RCAs to drive permanent resolution of sophisticated issues. Collaborate to be proactive in design, management, and improvement of high-quality customer-facing services, with a focus on automation, reliability, and observability.
  • Strategic Planning - Collaborate successfully with both internal and external partners to carry out the strategy for SRO to meet the vision of the SRE Transformation.
  • Process-Minded - Create and improve processes that facilitate SREs responding and mitigating incidents to quantitative goals.
  • Collaborative and Influential - Works successfully with other cross-cloud service owners (Developers, DBAs, Network etc) with positive relationships but with influence.
  • Data-Driven: We want to a leader who will use data to solve underlying problems in our systems.
  • 10+ years of Infrastructure Engineering, or Technical Operations experience
  • 5+ years leading Site Reliability Engineering, Operations, or Software Development teams preferably in globally distributed environments
  • Experience with management and troubleshooting of Internet services running on traditional data centers and Public Cloud (AWS, GCP, or Azure) infrastructure
  • Past experience in Incident Management, strong understanding of ITIL processes, and Scrum agile development methodologies
  • Expertise with enterprise observability and monitoring systems, such as Prometheus, OpenTSDB, and Splunk
  • Experience in leading and driving team Transformations that showcase Teamwork and Collaboration, Adaptability, Customer Focus, Results, and Innovation
  • Entrepreneurial-spirited with strong Aloha spirit, with experience successfully coaching individuals to achieve goals and focus on employee development
  • Experience in delivering Engineering Productivity, working in a Service Ownership model and a proven track record of Customer Success
  • Strong communication, organizational, analytical and problem solving skills and attention to detail