Software Developer/ Engineer/ Architect

Site Reliability Engineer

Salesforce is seeking a Site Reliability Engineering candidate to join the Site Reliability organization in our Dublin location. Working closely with our counterparts in the Infrastructure and R&D organizations, this organization provides a global team of engineers monitoring cloud service availability, supporting platform operations and resolving incidents. The incumbent in this role would demonstrate strong focus on tactical operations, as well as large-scale production engineering and orchestration.

The Site Reliability team keeps the Salesforce cloud and our customers protected. Come help us support the services that power the biggest Enterprise Cloud Computing company in the world!


Role Responsibilities - (Working as part of shift team - 4x10 hour shifts per week):

  • Continuous systems and service availability to maintain top performance for our customers
  • Incident management - Act in key support roles during major incidents
  • Also, participate in the technical review of the incident for problem management.
  • Work to automate detection and resolution of recurring issues in the production environment.
  • Ability to operate in the high-pressure environment and troubleshoot complex issues quickly, while successfully handling multiple priorities.
  • If it breaks, fix it fast, and ensure that any future problems will be automatically resolved.
  • Analyze performance related issues and work across the Technology organization to develop solutions to ensure service resiliency
  • Automate operations by developing software applications, tooling and API Integrations to connect disparate/distributed systems
  • Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
  • Participate in on-call rotation

Requirements:

  • Experience in support of distributed systems, with Linux/UNIX administration and internals knowledge
  • Experience in some of the following programming languages: Python/Java/C++, good knowledge of bash scripting.
  • Good knowledge of basic large-scale Internet service architectures (DNS, HTTP, Load Balancing, ...)
  • Experience in container based architectures: Docker/ Kubernetes etc.
  • Strong understanding of monitoring implementations and administration
  • Experience in a role with hands on complex Technical Problem Solving and troubleshooting
  • Past experience in Incident Management and good understanding of ITIL service operation
  • Good verbal and written communications skills
  • Be curious and ask questions


Preferred Qualifications:

  • BS in Computer Science plus relevant job-related experience
  • Python certification
  • Red Hat Certification
  • AWS/GCP