Technology, science and job news

Reliability Engineer – Incident Management

Groupon
Dublin
November 30, 2021

Reliability Engineer Linux FreeBSD MySQL Postgres KVM virtualisation

Groupon’s mission is to become the daily habit in local commerce and fulfill our purpose of building strong communities through thriving small businesses by connecting people to a vibrant, global marketplace for local services, experiences and goods. In the process, we’re positively impacting the lives of millions of customers and merchants globally. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk-taking and celebrates success. If you want to take more ownership of your career, then you're ready to be part of Groupon.

Are you a passionate, energetic and technology enthusiast eager to work at a rapid pace with the flexibility to work across our suite of technologies? Are you a problem solver; someone who enjoys debugging infrastructure platforms, resolving issues, and creating solutions for common problems? Do you get a little obsessed with the details?

We are looking for a Reliability Engineer (Incident Management) to join our team to support and optimise the process, implementation, and operational support of internal systems that span business side and engineering departments.

We're a "best of both worlds" kind of company. We're big enough to have resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact. We're curious, fun, a little intense, and kind of obsessed with helping local businesses thrive. Does that sound like a compelling place to work?

Our infrastructure ecosystem:

Heterogeneous ecosystem including Linux, FreeBSD, MySQL, Postgres, KVM virtualisation
Docker and Kubernetes
Netscaler Load Balancers and Juniper Network Infrastructure
Splunk and Elasticsearch
AWS Environment
Nagios, Pingdom, PagerDuty, and Wavefront monitoring tools
GitHub and JIRA

You’ll spend time on the following:

You will leverage Site Reliability Engineering best practices and ITIL Solutions Architecture framework to devise incident management strategies.
You are an Incident Commander, change manager, and a senior technical resource responsible for preventing, identifying, triaging, documenting, investigating, mitigating, and recovering from site/service impacting incidents across Groupon’s 600+ globally dispersed services.
You will assess, approve, and schedule risky changes, load testing, and maintenance windows.
You will facilitate the coordination and resolution of Post Mortems through best practices, and overseeing Problem Management.

We value engineers who are:

Customer-focused: We believe that doing what’s right for the customer is ultimately what will drive our business forward.
Team players. You believe that more can be achieved together. You listen to feedback and also provide supportive feedback to help others grow/improve.
Fast learners: We are willing to disrupt our existing business to trial new products and solutions. You love learning how to use new technologies and then rapidly apply them to new problems.
Pragmatic: We do things quickly to learn what our customers desire. You know when it’s appropriate to take shortcuts that don’t sacrifice quality or maintainability.
Owners: Engineers at Groupon know how to positively impact the business.

We’re excited about you if you have:

5+ years administering Linux system environments, as well as complete root cause analysis of site impacting issues.
3+ years troubleshooting, updating, and administering virtualization tools; ideally KVM and Container tools.
3+ years of experience creating unique search queries to identity, resolve, and prevent incidents and outages, and have experience owning all impacting events until resolution; including coordinate with Subject Matter Experts, triage tasks, create all associated documentation, complete action items, and Post Mortem.
3+ years of experience developing policies and procedures that improve overall production stability.
Good communication, consulting, and collaboration skills interfacing with senior leadership teams and able to work on-call one weekend out of every 6 weekends.
A plus if you have BS, MS or PhD in Computer Sciences or related fields.
A plus if you have designed and created tools to manage the site and services.

Apply

Reliability Engineer – Incident Management

Related News