Job Summary

As a Sr. Site Reliability Engineer, you operate seamlessly between development and operations. You will engage in and improve the lifecycle of cloud services - from design to deployment, operation, and refinement. You will maintain services by measuring and monitoring availability, latency, and overall system health. You will play a key role in scaling systems sustainably through automation and evolving them by pushing for changes to improve reliability and velocity. You will administer cloud-based environments that support our SaaS (Software as a Service) / IaaS (Infrastructure as a Service) offerings implemented on a microservices, container-based architecture (Kubernetes). To be successful in this role, you must be a motivated self-starter and self-learner, possess strong problem-solving skills; and be someone who embraces challenges.

Want more jobs like this?

Get jobs delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

Key Responsibilities

Managing production environments by monitoring availability and taking a holistic view of platform and product health.
Building software and systems to manage platform infrastructure and applications.
Expert in identifying and strategizing stability and reliability issues in product code.
Ability to mentor SRE (Site Reliability Engineering) engineers and coach automation first mindset
Partner with development teams to improve services through rigorous testing and release procedures
Ability to identify and balance the infrastructure feature acceleration vs. Well-deserved pause and fix
Debug and troubleshoot service bottlenecks throughout the whole software stack.
Measure and monitor availability, latency, and overall system health. Develop and improve instrumentation for monitoring and logging the health and availability of services
Conduct CICD operations to deploy an assortment of software deliverables across a global, production environment
Provide architectural guidance to optimize the observability stack across NetApp's cloud services
Be hands-on in the implementation of our observability stack. You have driven the deployment of these tools at scale and have experience working with a rapidly growing infrastructure.
Build dashboards to provide insights and visibility into critical business metrics for a variety of audiences from engineering and SRE teams

Job Requirements

At least 8 years of experience is required.
Experience in writing, troubleshooting and bug fixing product code
Scripting and infrastructure automation using, for example, Ansible, Python, Go, Perl, or Ruby.
Deep working knowledge of Containers, Kubernetes, and Serverless computing implementation.
Understanding of SDLC lifecycle and DevOps development methodologies
Experience with one of the three (AWS, Azure, GCP) hyper-scalers.
Experience in defining, applying, and managing SLAs, SLOs and SLIs to the product.
Good interpersonal communication and customer service skills are needed to work successfully with stakeholders in high-stress and/or ambiguous situations
This role includes on-call work and travel sometimes.

Education

Bachelor of Science Degree in Computer Science, a master's degree; or equivalent experience is required.

Compensation:
The target salary range for this position is 152,150 - 216,590 USD. The salary offered will be determined by the candidate's location, qualifications, experience, and education and may be outside of this range. Final compensation packages are competitive and in line with industry standards, reflecting a variety of factors, and include a comprehensive benefits package. This may cover Health Insurance, Life Insurance, Retirement or Pension Plans, Paid Time Off (PTO), various Leave options, Performance-Based Incentives, employee stock purchase plan, and/or restricted stocks (RSU's), with all offerings subject to regional variations and governed by local laws, regulations, and company policies. Benefits may vary by country and region, and further details will be provided as part of the recruitment process.

Nearest Major Market: Durham
Nearest Secondary Market: Raleigh
Job Segment: Cloud, Testing, Computer Science, Engineer, Technology, Engineering, Research

Sr. Site Reliability Engineer

Want more jobs like this?

Search Additional Jobs