Software Developer/ Engineer/ Architect

Big Data Reliability Engineering Manager

Job Description

The Big Data Reliability Engineering Manager is capable of working with abstract and general direction to develop multi-layered solutions that comprehend and specify business, application, data and infrastructure designs. This is a leadership role responsible for setting direction at an organizational level across multiple technologies or domains. This role also requires the ability to lead global teams through formal and informal relationships to drive platform reliability, innovation and operational excellence. Manages a team and is responsible for staffing, communicating, training and development, directing and prioritization of work, evaluating performance and removing roadblocks.

The Big Data Reliability Engineering Manager is accountable to work with platform delivery teams to develop robust big data and HPC platform designs that meet the criteria for functionality, stability, performance and resiliency to deliver effective IT solutions for our business. These solutions may encompass numerous standard and emerging technologies and involve sophisticated designs requiring collaboration with senior engineering staff from across the company and from external vendors.

Responsibilities include:

  • Drive reliability improvement and toil elimination across multiple enterprise scale big data and high-performance compute platforms.
  • Ensuring IT standards are met, production readiness processes are followed, and a quality system product is delivered and maintained.
  • Management and escalation of real-time incidents from identification to service restoration.
  • Advising junior staff in appropriate steps to diagnose, repair or remediate issues with systems in the assigned domain.
  • Development of detailed analysis of the causes of issues and development of short and long term plans to address such issues.
  • Builds a diverse and effective team by identifying/selecting the best talent across multiple locations.
  • Responsible for directing and prioritizing work, managing performance, and providing guidance / coaching to team members.
  • Manages the culture within their team, holds themselves and others accountable for demonstrating GMs values and cultural behaviors.
  • Performs other related duties as assigned.

Additional Job Description

These specific requirements include the following:

  • Advanced knowledge and experience with the principle of site responsibility and their implementation including the concepts of SLI/SLO's, toil elimination, error budgets, risk management, over-engineering, and monitoring.
  • Understanding of big data and high-performance compute platforms and technologies including both open source tools and platforms as well as selected commercial products in the area of virtualization. containerization, workload schedulers, advanced storage technologies, distributed systems, and large-scale cluster management.
  • Expertise related to technical change management and incident management practices including initial triage, root cause identification, impact analysis, service restoration techniques, and post-incident problem management.
  • Demonstrated expertise and experience with large-scale monitoring tools and solutions, including implementation of monitoring systems and automation for self-healing and correction of detected issues.
  • Extensive experience operating within an agile delivery model, and specifically with scrum principles and practices including backlog grooming and management, burn-down/burn-up tool utilization, story point estimation, scrum ceremonies and trade-off analysis and evaluation.
  • The role requires expert management level knowledge of various big data and high-performance compute technologies and their uses in the assigned domain.
  • Knowledge of site reliability concepts, principles, and practices and their application in a large enterprise context.
  • Robust knowledge of general IT concepts and system design principles to include reliability, availability, and scalability.
  • Extensive knowledge of how to leverage available resources to solve problems and deliver solutions.