Software Developer/ Engineer/ Architect

Site Reliability Engineer

Rapid7 is a leading provider of security data and analytics solutions that enable organisations to implement an effective, analytics-driven approach to cyber security. We combine our extensive experience in security data and analytics and deep insight into attacker behaviours and techniques to make sense of the wealth of data available to organisations about their IT environments and users. 

Our Dublin Engineering group is responsible for providing the log management services such as search, alerting and data visualisation to security professionals. Our systems ingest large amounts of data that need to be highly available and performant at all times. 

We are looking for a talented Site Reliability Engineering (SRE) with a deep interest in distributed systems, cloud computing and the architecture of large-scale systems. The SRE will be part of a team that ensures our Log Management services have the reliability and uptime appropriate to our user's needs. You will work with other engineering teams to help solve extremely challenging problems.

Some of the technologies we use include:

Java, Python, Terraform, Jenkins, Artifactory, Chef, Puppet, Ansible, Zookeeper, Docker, AWS (EC2, S3, CloudFormation, etc.), Cassandra, PostgreSQL, Kafka

Responsibilities:

  • Working closely with SRE Lead, Engineering teams, Architecture, Infrastructure and Product teams to improve the lifecycle of the Log Management services - from inception, design, deployment, operations, monitoring, security, upgrade and maintenance
  • Supporting services before they go live through activities such as design, deployment, migration strategy, monitoring, and playbook reviews
  • Maintaining services once they are live by measuring and monitoring availability, latency, and overall system health
  • Scaling systems through automation, driving service and infrastructure improvements as well as other ways
  • Troubleshooting production issues and liaising with relevant Engineering or Infrastructure team for a resolution
  • Participating in on-call support, and incident response follow-ups such as post-mortems

Skills and Understanding:

  • Previous experience in a SRE role
  • 5+ years of experience scaling SaaS services and infrastructure
  • Solid experience of developing, scaling, deploying and troubleshooting large-scale systems
  • Solid understanding of deployment and monitoring frameworks
  • Ability to debug, optimise code and automate routine tasks
  • Advanced understanding of System Performance and tuning
  • Excellent knowledge of NoSQL concepts
  • Excellent knowledge of OOP languages such as Java
  • Excellent knowledge of scripting languages such as Python
  • Experience with algorithms and data structures
  • Excellent knowledge of RESTFul architectures
  • Understanding of Unix/Linux operating systems
  • Proficient in AWS services, including EC2, RDS, VPC, networking, S3, MSK, etc.
  • Systematic problem-solving approach
  • Excellent communication & influencing skills
  • Excellent technical writing skills