Technology, science and job news

Senior Site Reliability Engineer

Salesforce Ireland
Ireland - Dublin ,United Kingdom - London
September 29, 2021

Salesforce is seeking an engineering candidate to join the Industries Site Reliability organization. This organization provides a distributed team of engineers monitoring cloud service availability and ready to swiftly repair any service-impacting issues. Seven days a week, 24 hours a day, in a follow-the-sun model, the Site Reliability team keeps the Salesforce Industries Cloud services and our customers protected. As a member of the SRE team, you will be tasked with detecting and resolving incidents within minutes. This objective is met by monitoring the services, reacting to problems, and proactively addressing issues before they affect performance or availability.

Position Description:
When not fighting fires, the team is responsible for fire prevention through monitoring, automation, self-healing and resiliency initiatives, destructive testing, and game day exercises. The incumbent in this role would demonstrate a strong focus on tactical operations, as well as large-scale production engineering and orchestration.

Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.
Incident management - Act in key response roles during major incidents e.g. Sev0, Sev1. Also, participate in the technical review of the incident for problem management
Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the Global Solutions team
Ensuring that work carried out by members of the team is executed in such a way as to align with the company’s internal compliance policy and directives
Being available to discuss and resolve technical issues and customer concerns with other technical staff as the need arises
Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required
Ability to operate in the fast paced environment and troubleshoot sophisticated issues quickly optimally balance multiple priorities
Work to automate detection and resolution of recurring issues in the production environment
Gain a deep understanding of the application and its inner workings and be able to pinpoint code defects to speed up remediations

Basic Requirements:

Systems engineering experience in enterprise scale internet service engineering or related role
Experience with monitoring implementations and administration
Strong interpersonal skills (Written and Oral)
Past experience in Incident Management for customer facing applications
Experience in working in a 24/7 team

Preferred Qualifications:

Python/BASH/GO scripting experience
Prior Automated deployment experience
Prior experience monitoring and alert systems
Experience troubleshooting relational databases and distributed platforms
Experience in maintaining Java and GO applications
Experience in Docker orchestration and management.
Experience with Kubernetes
Hands on experience configuring and running AWS (Amazon Web Services), using the CLI/SDKs
Experience running systems monitoring and alerts.
Experience with JVM optimization and Java server technologies like Tomcat or Jetty

Apply

Senior Site Reliability Engineer

Related News