NVIDIA seeks a data center infrastructure optimization and resiliency team manager to join its infrastructure specialist team. Academic and commercial groups worldwide use NVIDIA products to redefine deep learning, data analytics, and power data centers. Join the team building many of the world's largest and fastest data centers! NVIDIA is looking for someone who can lead a customer team responsible for production AI infrastructure and workflow optimization, working on a complex customer-focused operation optimization and related problem-solving, Planning, facilitating, and executing continuous improvement events using NVIDIA telemetry tools, and interfacing with company stakeholder management that requires excellent interpersonal skills. This role will involve interacting with customers, partners, and internal teams to analyze, define, and implement large-scale data center infrastructure optimization. These efforts include a combination of leading practical experience in handling data center team systems, networks, cloud operation and orchestration, AI workload resiliency, and performance optimization with an assurance of continual and efficient Planning, operation, validation services, and team performance.
Want more jobs like this?
Get jobs that are Remote delivered to your inbox every week.
What you will be doing:
- Manage regional, customer-dedicated teams focused on optimizing customer infrastructure and enhancing resiliency.
- Lead a team that inspects and observes infrastructure and AI workloads to ensure system health and performance.
- Establish and refine optimization workflows, collaborate with customers and analytics partners, and analyze results to improve AI workload production processes.
- Work closely with customers and NVIDIA teams to prioritize, frame, and implement system improvements related to customer health and operational process evolution.
- Partner with development, tools, and support teams to optimize GPU and infrastructure utilization, ensuring efficient capacity consumption.
- Offer technical guidance and oversight for systems and networking activities. Served as the primary manager across all initiatives, allocated team schedules, prioritized tasks, and provided feedback and direction on complex technical issues.
- Work closely with the customer IT infrastructure teams to design and implement data center network changes, accommodating new and changing requirements.
- Ensure deployment risks are minimized across regional activities to maintain operational integrity.
- Establish, supervise, and continuously improve processes to prevent infrastructure, systems, and services failures.
- Guarantee that all tasks are completed with high quality, avoiding negative impacts on internal/external users and business operations.
- Plan and implemented ongoing improvements to deployment methodologies for greater effectiveness and efficiency.
- Ensure all operational KPIs and metrics are tracked and met. Maintain a strong commitment to service quality and user experience, striving for continuous improvement.
- Keep informed of developments in NVIDIA products, particularly in data center facilities, systems, and networking, and provide recommendations to address current and future needs.
- Ensure that documentation, policies, procedures, and guidelines are in place for systems, resources, and activities.
- Supervise, evaluate, mentor, and coach team members, fostering a culture of continuous learning and professional growth.
What we need to see:
- 10+ overall years of demonstrable and confirmed service operational management experience in enterprise-level data centers with continual infrastructure and service improvement.
- 7+ years' experience of people management.
- Data Center, Servers, and Networks related certification - preferred
- Bachelor's degree or equivalent experience.
- In-depth Practical knowledge and experience of data center environments, servers, network equipment, operations and services
- Extensive experience in installing, monitoring, and maintaining data center equipment.
- Analytical Attitude & Problem Solving - able to analyze information, problems, situations, practices, and/or procedures, collect and interpret data, reason logically, establish facts, identify and define existing and potential issues, recognize the interrelationships among elements, draw valid conclusions, develop recommendations, as well as alternative courses of action, select appropriate course, follow up, and evaluate
- Exceptional ability to work as part of a team, provide IT support, and resolve errors.
- Organization & Time Management - able to plan, schedule, and organize tasks related to the job to achieve goals within or ahead of established time frames.
- Willingness to travel (25%).
Way to stand out from the crowd:
- Experience in data center operations process, safety, and security measures.
- Knowledge of data center Infrastructure
- Outstanding social skills.