We are seeking a highly motivated Site Reliability Engineer (SRE) to join our Applications Infrastructure organization. This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud. The SRE role focuses on ensuring production health to prevent outages by defining and developing robust software engineering solutions and practices. These efforts simplify the operating environment, enhance the reliability of NVIDIA cloud services, and expedite feature rollouts.

What You'll Be Doing:

Develop and integrate new software, tools, and analytics to improve the availability, scalability, latency, and efficiency of our cloud services.
Manage upgrades and automated rollbacks across all clusters.
Maintain Service Level Agreements (SLAs) by collaborating with developers to define Service Level Indicators (SLIs) and design stable, secure services.
Guide the Change Advisory Board and Root Cause Corrective Action (RCCA) processes.
Collaborate with engineering, DevOps, and product leads across the GPU cloud services stack to build fast, reliable, and durable production systems.
Drive process changes to enhance the reliability and performance of cloud services.
Debug production issues across services and levels of the stack.
Improve operational processes.

Want more jobs like this?

Get jobs in Santa Clara, CA delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

What We Need to See:

Bachelor's degree in Computer Science or a related field, or equivalent experience.
5+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance tuning, and application issue resolution.
5+ years of experience in authoring and debugging software written in C++ and Python.
Hands-on experience with Kubernetes-based cloud environments.
Multi-cloud experience.
Experience working with partners across multiple teams.
Experience operating production systems.

Ways to Stand Out from the Crowd:

Background with Software as a Service (SaaS) offerings.
Experience in application issues, algorithms, and data structures.

With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our exclusive engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you.

The base salary range is 140,000 USD - 258,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Site Reliability Engineer

Want more jobs like this?

Search Additional Jobs